The course project is based on the Home Credit Default Risk (HCDR) Kaggle Competition. The goal of this project is to predict whether or not a client will repay a loan. In order to make sure that people who struggle to get loans due to insufficient or non-existent credit histories have a positive loan experience, Home Credit makes use of a variety of alternative data--including telco and transactional information--to predict their clients' repayment abilities.
Kaggle is a Data Science Competition Platform which shares a lot of datasets. In the past, it was troublesome to submit your result as your have to go through the console in your browser and drag your files there. Now you can interact with Kaggle via the command line. E.g.,
! kaggle competitions files home-credit-default-risk
It is quite easy to setup, it takes me less than 15 minutes to finish a submission.
kaggle.json filekaggle.json in the right placeFor more detailed information on setting the Kaggle API see here and here.
!pip install kaggle
Requirement already satisfied: kaggle in c:\users\jgdsh\appdata\local\programs\python\python310\lib\site-packages (1.5.16) Requirement already satisfied: six>=1.10 in c:\users\jgdsh\appdata\local\programs\python\python310\lib\site-packages (from kaggle) (1.16.0) Requirement already satisfied: certifi in c:\users\jgdsh\appdata\local\programs\python\python310\lib\site-packages (from kaggle) (2022.12.7) Requirement already satisfied: python-dateutil in c:\users\jgdsh\appdata\local\programs\python\python310\lib\site-packages (from kaggle) (2.8.2) Requirement already satisfied: requests in c:\users\jgdsh\appdata\local\programs\python\python310\lib\site-packages (from kaggle) (2.31.0) Requirement already satisfied: tqdm in c:\users\jgdsh\appdata\local\programs\python\python310\lib\site-packages (from kaggle) (4.64.1) Requirement already satisfied: python-slugify in c:\users\jgdsh\appdata\local\programs\python\python310\lib\site-packages (from kaggle) (8.0.1) Requirement already satisfied: urllib3 in c:\users\jgdsh\appdata\local\programs\python\python310\lib\site-packages (from kaggle) (1.26.14) Requirement already satisfied: bleach in c:\users\jgdsh\appdata\local\programs\python\python310\lib\site-packages (from kaggle) (6.1.0) Requirement already satisfied: webencodings in c:\users\jgdsh\appdata\local\programs\python\python310\lib\site-packages (from bleach->kaggle) (0.5.1) Requirement already satisfied: text-unidecode>=1.3 in c:\users\jgdsh\appdata\local\programs\python\python310\lib\site-packages (from python-slugify->kaggle) (1.3) Requirement already satisfied: charset-normalizer<4,>=2 in c:\users\jgdsh\appdata\local\programs\python\python310\lib\site-packages (from requests->kaggle) (3.0.1) Requirement already satisfied: idna<4,>=2.5 in c:\users\jgdsh\appdata\local\programs\python\python310\lib\site-packages (from requests->kaggle) (2.10) Requirement already satisfied: colorama in c:\users\jgdsh\appdata\local\programs\python\python310\lib\site-packages (from tqdm->kaggle) (0.4.6)
!pwd
'pwd' is not recognized as an internal or external command, operable program or batch file.
!ls -l ~/.kaggle/kaggle.json
# We use this code when working with colab environment
# from google.colab import files
json_file_not_exists = True #Change this to false if you already have json from kaggle
if json_file_not_exists:
files.upload()
!mkdir ~/.kaggle
!cp kaggle.json ~/.kaggle
Saving kaggle.json to kaggle.json
!chmod 600 ~/.kaggle/kaggle.json
! kaggle competitions files home-credit-default-risk
name size creationDate ---------------------------------- ----- ------------------- sample_submission.csv 524KB 2019-12-11 02:55:35 credit_card_balance.csv 405MB 2019-12-11 02:55:35 installments_payments.csv 690MB 2019-12-11 02:55:35 HomeCredit_columns_description.csv 37KB 2019-12-11 02:55:35 bureau.csv 162MB 2019-12-11 02:55:35 application_test.csv 25MB 2019-12-11 02:55:35 POS_CASH_balance.csv 375MB 2019-12-11 02:55:35 previous_application.csv 386MB 2019-12-11 02:55:35 bureau_balance.csv 358MB 2019-12-11 02:55:35 application_train.csv 158MB 2019-12-11 02:55:35
Many people struggle to get loans due to insufficient or non-existent credit histories. And, unfortunately, this population is often taken advantage of by untrustworthy lenders.
Home Credit strives to broaden financial inclusion for the unbanked population by providing a positive and safe borrowing experience. In order to make sure this underserved population has a positive loan experience, Home Credit makes use of a variety of alternative data--including telco and transactional information--to predict their clients' repayment abilities.
While Home Credit is currently using various statistical and machine learning methods to make these predictions, they're challenging Kagglers to help them unlock the full potential of their data. Doing so will ensure that clients capable of repayment are not rejected and that loans are given with a principal, maturity, and repayment calendar that will empower their clients to be successful.
Home Credit is a non-banking financial institution, founded in 1997 in the Czech Republic.
The company operates in 14 countries (including United States, Russia, Kazahstan, Belarus, China, India) and focuses on lending primarily to people with little or no credit history which will either not obtain loans or became victims of untrustworthly lenders.
Home Credit group has over 29 million customers, total assests of 21 billions Euro, over 160 millions loans, with the majority in Asia and and almost half of them in China (as of 19-05-2018).
While Home Credit is currently using various statistical and machine learning methods to make these predictions, they're challenging Kagglers to help them unlock the full potential of their data. Doing so will ensure that clients capable of repayment are not rejected and that loans are given with a principal, maturity, and repayment calendar that will empower their clients to be successful.
The HomeCredit_columns_description.csv acts as a data dictioanry.
There are 7 different sources of data:
name [ rows cols] MegaBytes
----------------------- ------------------ -------
application_train : [ 307,511, 122]: 158MB
application_test : [ 48,744, 121]: 25MB
bureau : [ 1,716,428, 17] 162MB
bureau_balance : [ 27,299,925, 3]: 358MB
credit_card_balance : [ 3,840,312, 23] 405MB
installments_payments : [ 13,605,401, 8] 690MB
previous_application : [ 1,670,214, 37] 386MB
POS_CASH_balance : [ 10,001,358, 8] 375MB
Create a base directory:
DATA_DIR = "../../../Data/home-credit-default-risk" #same level as course repo in the data directory
Please download the project data files and data dictionary and unzip them using either of the following approaches:
Download button on the following Data Webpage and unzip the zip file to the BASE_DIR# We use this code when working with colab environment
from google.colab import drive
drive.mount('/content/drive')
Mounted at /content/drive
DATA_DIR = "../Data/home-credit-default-risk" #same level as course repo in the data directory
#DATA_DIR = os.path.join('./ddddd/')
!mkdir -p $DATA_DIR
The syntax of the command is incorrect.
!ls -l $DATA_DIR
'ls' is not recognized as an internal or external command, operable program or batch file.
! kaggle competitions download home-credit-default-risk -p $DATA_DIR
home-credit-default-risk.zip: Skipping, found more recently modified local copy (use --force to force download)
!pwd
/content
!ls -l $DATA_DIR
total 704708 -rw-r--r-- 1 root root 721616255 Dec 11 2019 home-credit-default-risk.zip
import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder
import os
import zipfile
from sklearn.base import BaseEstimator, TransformerMixin
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import MinMaxScaler
from sklearn.pipeline import Pipeline, FeatureUnion
from pandas.plotting import scatter_matrix
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
import warnings
warnings.filterwarnings('ignore')
unzippingReq = False #True
if unzippingReq: #please modify this code
zip_ref = zipfile.ZipFile(f'{DATA_DIR}/home-credit-default-risk.zip', 'r')
# extractall(): Extract all members from the archive to the current working directory. path specifies a different directory to extract to
zip_ref.extractall(DATA_DIR)
zip_ref.close()
ls -l ../../../Data/home-credit-default-risk/application_train.csv
Invalid switch - "..".
import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder
import os
import zipfile
from sklearn.base import BaseEstimator, TransformerMixin
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import MinMaxScaler
from sklearn.pipeline import Pipeline, FeatureUnion
from pandas.plotting import scatter_matrix
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
import warnings
warnings.filterwarnings('ignore')
def load_data(in_path, name):
df = pd.read_csv(in_path)
print(f"{name}: shape is {df.shape}")
print(df.info())
display(df.head(5))
return df
datasets={} # lets store the datasets in a dictionary so we can keep track of them easily
ds_name = 'application_train'
# DATA_DIR=f"{DATA_DIR}/home-credit-default-risk/"
datasets[ds_name] = load_data(os.path.join(DATA_DIR, f'{ds_name}.csv'), ds_name)
datasets['application_train'].shape
application_train: shape is (307511, 122) <class 'pandas.core.frame.DataFrame'> RangeIndex: 307511 entries, 0 to 307510 Columns: 122 entries, SK_ID_CURR to AMT_REQ_CREDIT_BUREAU_YEAR dtypes: float64(65), int64(41), object(16) memory usage: 286.2+ MB None
| SK_ID_CURR | TARGET | NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | ... | FLAG_DOCUMENT_18 | FLAG_DOCUMENT_19 | FLAG_DOCUMENT_20 | FLAG_DOCUMENT_21 | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 100002 | 1 | Cash loans | M | N | Y | 0 | 202500.0 | 406597.5 | 24700.5 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| 1 | 100003 | 0 | Cash loans | F | N | N | 0 | 270000.0 | 1293502.5 | 35698.5 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 2 | 100004 | 0 | Revolving loans | M | Y | Y | 0 | 67500.0 | 135000.0 | 6750.0 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 3 | 100006 | 0 | Cash loans | F | N | Y | 0 | 135000.0 | 312682.5 | 29686.5 | ... | 0 | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN |
| 4 | 100007 | 0 | Cash loans | M | N | Y | 0 | 121500.0 | 513000.0 | 21865.5 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
5 rows × 122 columns
(307511, 122)
DATA_DIR
'../Data/home-credit-default-risk'
ds_name = 'application_test'
datasets[ds_name] = load_data(os.path.join(DATA_DIR, f'{ds_name}.csv'), ds_name)
application_test: shape is (48744, 121) <class 'pandas.core.frame.DataFrame'> RangeIndex: 48744 entries, 0 to 48743 Columns: 121 entries, SK_ID_CURR to AMT_REQ_CREDIT_BUREAU_YEAR dtypes: float64(65), int64(40), object(16) memory usage: 45.0+ MB None
| SK_ID_CURR | NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | AMT_GOODS_PRICE | ... | FLAG_DOCUMENT_18 | FLAG_DOCUMENT_19 | FLAG_DOCUMENT_20 | FLAG_DOCUMENT_21 | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 100001 | Cash loans | F | N | Y | 0 | 135000.0 | 568800.0 | 20560.5 | 450000.0 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 1 | 100005 | Cash loans | M | N | Y | 0 | 99000.0 | 222768.0 | 17370.0 | 180000.0 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 3.0 |
| 2 | 100013 | Cash loans | M | Y | Y | 0 | 202500.0 | 663264.0 | 69777.0 | 630000.0 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 4.0 |
| 3 | 100028 | Cash loans | F | N | Y | 2 | 315000.0 | 1575000.0 | 49018.5 | 1575000.0 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 3.0 |
| 4 | 100038 | Cash loans | M | Y | N | 1 | 180000.0 | 625500.0 | 32067.0 | 625500.0 | ... | 0 | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN |
5 rows × 121 columns
The application dataset has the most information about the client: Gender, income, family status, education ...
%%time
ds_names = ("application_train", "application_test", "bureau","bureau_balance","credit_card_balance","installments_payments",
"previous_application","POS_CASH_balance")
for ds_name in ds_names:
datasets[ds_name] = load_data(os.path.join(DATA_DIR, f'{ds_name}.csv'), ds_name)
application_train: shape is (307511, 122) <class 'pandas.core.frame.DataFrame'> RangeIndex: 307511 entries, 0 to 307510 Columns: 122 entries, SK_ID_CURR to AMT_REQ_CREDIT_BUREAU_YEAR dtypes: float64(65), int64(41), object(16) memory usage: 286.2+ MB None
| SK_ID_CURR | TARGET | NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | ... | FLAG_DOCUMENT_18 | FLAG_DOCUMENT_19 | FLAG_DOCUMENT_20 | FLAG_DOCUMENT_21 | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 100002 | 1 | Cash loans | M | N | Y | 0 | 202500.0 | 406597.5 | 24700.5 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| 1 | 100003 | 0 | Cash loans | F | N | N | 0 | 270000.0 | 1293502.5 | 35698.5 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 2 | 100004 | 0 | Revolving loans | M | Y | Y | 0 | 67500.0 | 135000.0 | 6750.0 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 3 | 100006 | 0 | Cash loans | F | N | Y | 0 | 135000.0 | 312682.5 | 29686.5 | ... | 0 | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN |
| 4 | 100007 | 0 | Cash loans | M | N | Y | 0 | 121500.0 | 513000.0 | 21865.5 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
5 rows × 122 columns
application_test: shape is (48744, 121) <class 'pandas.core.frame.DataFrame'> RangeIndex: 48744 entries, 0 to 48743 Columns: 121 entries, SK_ID_CURR to AMT_REQ_CREDIT_BUREAU_YEAR dtypes: float64(65), int64(40), object(16) memory usage: 45.0+ MB None
| SK_ID_CURR | NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | AMT_GOODS_PRICE | ... | FLAG_DOCUMENT_18 | FLAG_DOCUMENT_19 | FLAG_DOCUMENT_20 | FLAG_DOCUMENT_21 | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 100001 | Cash loans | F | N | Y | 0 | 135000.0 | 568800.0 | 20560.5 | 450000.0 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 1 | 100005 | Cash loans | M | N | Y | 0 | 99000.0 | 222768.0 | 17370.0 | 180000.0 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 3.0 |
| 2 | 100013 | Cash loans | M | Y | Y | 0 | 202500.0 | 663264.0 | 69777.0 | 630000.0 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 4.0 |
| 3 | 100028 | Cash loans | F | N | Y | 2 | 315000.0 | 1575000.0 | 49018.5 | 1575000.0 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 3.0 |
| 4 | 100038 | Cash loans | M | Y | N | 1 | 180000.0 | 625500.0 | 32067.0 | 625500.0 | ... | 0 | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN |
5 rows × 121 columns
bureau: shape is (1716428, 17) <class 'pandas.core.frame.DataFrame'> RangeIndex: 1716428 entries, 0 to 1716427 Data columns (total 17 columns): # Column Dtype --- ------ ----- 0 SK_ID_CURR int64 1 SK_ID_BUREAU int64 2 CREDIT_ACTIVE object 3 CREDIT_CURRENCY object 4 DAYS_CREDIT int64 5 CREDIT_DAY_OVERDUE int64 6 DAYS_CREDIT_ENDDATE float64 7 DAYS_ENDDATE_FACT float64 8 AMT_CREDIT_MAX_OVERDUE float64 9 CNT_CREDIT_PROLONG int64 10 AMT_CREDIT_SUM float64 11 AMT_CREDIT_SUM_DEBT float64 12 AMT_CREDIT_SUM_LIMIT float64 13 AMT_CREDIT_SUM_OVERDUE float64 14 CREDIT_TYPE object 15 DAYS_CREDIT_UPDATE int64 16 AMT_ANNUITY float64 dtypes: float64(8), int64(6), object(3) memory usage: 222.6+ MB None
| SK_ID_CURR | SK_ID_BUREAU | CREDIT_ACTIVE | CREDIT_CURRENCY | DAYS_CREDIT | CREDIT_DAY_OVERDUE | DAYS_CREDIT_ENDDATE | DAYS_ENDDATE_FACT | AMT_CREDIT_MAX_OVERDUE | CNT_CREDIT_PROLONG | AMT_CREDIT_SUM | AMT_CREDIT_SUM_DEBT | AMT_CREDIT_SUM_LIMIT | AMT_CREDIT_SUM_OVERDUE | CREDIT_TYPE | DAYS_CREDIT_UPDATE | AMT_ANNUITY | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 215354 | 5714462 | Closed | currency 1 | -497 | 0 | -153.0 | -153.0 | NaN | 0 | 91323.0 | 0.0 | NaN | 0.0 | Consumer credit | -131 | NaN |
| 1 | 215354 | 5714463 | Active | currency 1 | -208 | 0 | 1075.0 | NaN | NaN | 0 | 225000.0 | 171342.0 | NaN | 0.0 | Credit card | -20 | NaN |
| 2 | 215354 | 5714464 | Active | currency 1 | -203 | 0 | 528.0 | NaN | NaN | 0 | 464323.5 | NaN | NaN | 0.0 | Consumer credit | -16 | NaN |
| 3 | 215354 | 5714465 | Active | currency 1 | -203 | 0 | NaN | NaN | NaN | 0 | 90000.0 | NaN | NaN | 0.0 | Credit card | -16 | NaN |
| 4 | 215354 | 5714466 | Active | currency 1 | -629 | 0 | 1197.0 | NaN | 77674.5 | 0 | 2700000.0 | NaN | NaN | 0.0 | Consumer credit | -21 | NaN |
bureau_balance: shape is (4728516, 3) <class 'pandas.core.frame.DataFrame'> RangeIndex: 4728516 entries, 0 to 4728515 Data columns (total 3 columns): # Column Dtype --- ------ ----- 0 SK_ID_BUREAU int64 1 MONTHS_BALANCE float64 2 STATUS object dtypes: float64(1), int64(1), object(1) memory usage: 108.2+ MB None
| SK_ID_BUREAU | MONTHS_BALANCE | STATUS | |
|---|---|---|---|
| 0 | 5715448 | 0.0 | C |
| 1 | 5715448 | -1.0 | C |
| 2 | 5715448 | -2.0 | C |
| 3 | 5715448 | -3.0 | C |
| 4 | 5715448 | -4.0 | C |
credit_card_balance: shape is (3840312, 23) <class 'pandas.core.frame.DataFrame'> RangeIndex: 3840312 entries, 0 to 3840311 Data columns (total 23 columns): # Column Dtype --- ------ ----- 0 SK_ID_PREV int64 1 SK_ID_CURR int64 2 MONTHS_BALANCE int64 3 AMT_BALANCE float64 4 AMT_CREDIT_LIMIT_ACTUAL int64 5 AMT_DRAWINGS_ATM_CURRENT float64 6 AMT_DRAWINGS_CURRENT float64 7 AMT_DRAWINGS_OTHER_CURRENT float64 8 AMT_DRAWINGS_POS_CURRENT float64 9 AMT_INST_MIN_REGULARITY float64 10 AMT_PAYMENT_CURRENT float64 11 AMT_PAYMENT_TOTAL_CURRENT float64 12 AMT_RECEIVABLE_PRINCIPAL float64 13 AMT_RECIVABLE float64 14 AMT_TOTAL_RECEIVABLE float64 15 CNT_DRAWINGS_ATM_CURRENT float64 16 CNT_DRAWINGS_CURRENT int64 17 CNT_DRAWINGS_OTHER_CURRENT float64 18 CNT_DRAWINGS_POS_CURRENT float64 19 CNT_INSTALMENT_MATURE_CUM float64 20 NAME_CONTRACT_STATUS object 21 SK_DPD int64 22 SK_DPD_DEF int64 dtypes: float64(15), int64(7), object(1) memory usage: 673.9+ MB None
| SK_ID_PREV | SK_ID_CURR | MONTHS_BALANCE | AMT_BALANCE | AMT_CREDIT_LIMIT_ACTUAL | AMT_DRAWINGS_ATM_CURRENT | AMT_DRAWINGS_CURRENT | AMT_DRAWINGS_OTHER_CURRENT | AMT_DRAWINGS_POS_CURRENT | AMT_INST_MIN_REGULARITY | ... | AMT_RECIVABLE | AMT_TOTAL_RECEIVABLE | CNT_DRAWINGS_ATM_CURRENT | CNT_DRAWINGS_CURRENT | CNT_DRAWINGS_OTHER_CURRENT | CNT_DRAWINGS_POS_CURRENT | CNT_INSTALMENT_MATURE_CUM | NAME_CONTRACT_STATUS | SK_DPD | SK_DPD_DEF | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2562384 | 378907 | -6 | 56.970 | 135000 | 0.0 | 877.5 | 0.0 | 877.5 | 1700.325 | ... | 0.000 | 0.000 | 0.0 | 1 | 0.0 | 1.0 | 35.0 | Active | 0 | 0 |
| 1 | 2582071 | 363914 | -1 | 63975.555 | 45000 | 2250.0 | 2250.0 | 0.0 | 0.0 | 2250.000 | ... | 64875.555 | 64875.555 | 1.0 | 1 | 0.0 | 0.0 | 69.0 | Active | 0 | 0 |
| 2 | 1740877 | 371185 | -7 | 31815.225 | 450000 | 0.0 | 0.0 | 0.0 | 0.0 | 2250.000 | ... | 31460.085 | 31460.085 | 0.0 | 0 | 0.0 | 0.0 | 30.0 | Active | 0 | 0 |
| 3 | 1389973 | 337855 | -4 | 236572.110 | 225000 | 2250.0 | 2250.0 | 0.0 | 0.0 | 11795.760 | ... | 233048.970 | 233048.970 | 1.0 | 1 | 0.0 | 0.0 | 10.0 | Active | 0 | 0 |
| 4 | 1891521 | 126868 | -1 | 453919.455 | 450000 | 0.0 | 11547.0 | 0.0 | 11547.0 | 22924.890 | ... | 453919.455 | 453919.455 | 0.0 | 1 | 0.0 | 1.0 | 101.0 | Active | 0 | 0 |
5 rows × 23 columns
installments_payments: shape is (13605401, 8) <class 'pandas.core.frame.DataFrame'> RangeIndex: 13605401 entries, 0 to 13605400 Data columns (total 8 columns): # Column Dtype --- ------ ----- 0 SK_ID_PREV int64 1 SK_ID_CURR int64 2 NUM_INSTALMENT_VERSION float64 3 NUM_INSTALMENT_NUMBER int64 4 DAYS_INSTALMENT float64 5 DAYS_ENTRY_PAYMENT float64 6 AMT_INSTALMENT float64 7 AMT_PAYMENT float64 dtypes: float64(5), int64(3) memory usage: 830.4 MB None
| SK_ID_PREV | SK_ID_CURR | NUM_INSTALMENT_VERSION | NUM_INSTALMENT_NUMBER | DAYS_INSTALMENT | DAYS_ENTRY_PAYMENT | AMT_INSTALMENT | AMT_PAYMENT | |
|---|---|---|---|---|---|---|---|---|
| 0 | 1054186 | 161674 | 1.0 | 6 | -1180.0 | -1187.0 | 6948.360 | 6948.360 |
| 1 | 1330831 | 151639 | 0.0 | 34 | -2156.0 | -2156.0 | 1716.525 | 1716.525 |
| 2 | 2085231 | 193053 | 2.0 | 1 | -63.0 | -63.0 | 25425.000 | 25425.000 |
| 3 | 2452527 | 199697 | 1.0 | 3 | -2418.0 | -2426.0 | 24350.130 | 24350.130 |
| 4 | 2714724 | 167756 | 1.0 | 2 | -1383.0 | -1366.0 | 2165.040 | 2160.585 |
previous_application: shape is (1670214, 37) <class 'pandas.core.frame.DataFrame'> RangeIndex: 1670214 entries, 0 to 1670213 Data columns (total 37 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 SK_ID_PREV 1670214 non-null int64 1 SK_ID_CURR 1670214 non-null int64 2 NAME_CONTRACT_TYPE 1670214 non-null object 3 AMT_ANNUITY 1297979 non-null float64 4 AMT_APPLICATION 1670214 non-null float64 5 AMT_CREDIT 1670213 non-null float64 6 AMT_DOWN_PAYMENT 774370 non-null float64 7 AMT_GOODS_PRICE 1284699 non-null float64 8 WEEKDAY_APPR_PROCESS_START 1670214 non-null object 9 HOUR_APPR_PROCESS_START 1670214 non-null int64 10 FLAG_LAST_APPL_PER_CONTRACT 1670214 non-null object 11 NFLAG_LAST_APPL_IN_DAY 1670214 non-null int64 12 RATE_DOWN_PAYMENT 774370 non-null float64 13 RATE_INTEREST_PRIMARY 5951 non-null float64 14 RATE_INTEREST_PRIVILEGED 5951 non-null float64 15 NAME_CASH_LOAN_PURPOSE 1670214 non-null object 16 NAME_CONTRACT_STATUS 1670214 non-null object 17 DAYS_DECISION 1670214 non-null int64 18 NAME_PAYMENT_TYPE 1670214 non-null object 19 CODE_REJECT_REASON 1670214 non-null object 20 NAME_TYPE_SUITE 849809 non-null object 21 NAME_CLIENT_TYPE 1670214 non-null object 22 NAME_GOODS_CATEGORY 1670214 non-null object 23 NAME_PORTFOLIO 1670214 non-null object 24 NAME_PRODUCT_TYPE 1670214 non-null object 25 CHANNEL_TYPE 1670214 non-null object 26 SELLERPLACE_AREA 1670214 non-null int64 27 NAME_SELLER_INDUSTRY 1670214 non-null object 28 CNT_PAYMENT 1297984 non-null float64 29 NAME_YIELD_GROUP 1670214 non-null object 30 PRODUCT_COMBINATION 1669868 non-null object 31 DAYS_FIRST_DRAWING 997149 non-null float64 32 DAYS_FIRST_DUE 997149 non-null float64 33 DAYS_LAST_DUE_1ST_VERSION 997149 non-null float64 34 DAYS_LAST_DUE 997149 non-null float64 35 DAYS_TERMINATION 997149 non-null float64 36 NFLAG_INSURED_ON_APPROVAL 997149 non-null float64 dtypes: float64(15), int64(6), object(16) memory usage: 471.5+ MB None
| SK_ID_PREV | SK_ID_CURR | NAME_CONTRACT_TYPE | AMT_ANNUITY | AMT_APPLICATION | AMT_CREDIT | AMT_DOWN_PAYMENT | AMT_GOODS_PRICE | WEEKDAY_APPR_PROCESS_START | HOUR_APPR_PROCESS_START | ... | NAME_SELLER_INDUSTRY | CNT_PAYMENT | NAME_YIELD_GROUP | PRODUCT_COMBINATION | DAYS_FIRST_DRAWING | DAYS_FIRST_DUE | DAYS_LAST_DUE_1ST_VERSION | DAYS_LAST_DUE | DAYS_TERMINATION | NFLAG_INSURED_ON_APPROVAL | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2030495 | 271877 | Consumer loans | 1730.430 | 17145.0 | 17145.0 | 0.0 | 17145.0 | SATURDAY | 15 | ... | Connectivity | 12.0 | middle | POS mobile with interest | 365243.0 | -42.0 | 300.0 | -42.0 | -37.0 | 0.0 |
| 1 | 2802425 | 108129 | Cash loans | 25188.615 | 607500.0 | 679671.0 | NaN | 607500.0 | THURSDAY | 11 | ... | XNA | 36.0 | low_action | Cash X-Sell: low | 365243.0 | -134.0 | 916.0 | 365243.0 | 365243.0 | 1.0 |
| 2 | 2523466 | 122040 | Cash loans | 15060.735 | 112500.0 | 136444.5 | NaN | 112500.0 | TUESDAY | 11 | ... | XNA | 12.0 | high | Cash X-Sell: high | 365243.0 | -271.0 | 59.0 | 365243.0 | 365243.0 | 1.0 |
| 3 | 2819243 | 176158 | Cash loans | 47041.335 | 450000.0 | 470790.0 | NaN | 450000.0 | MONDAY | 7 | ... | XNA | 12.0 | middle | Cash X-Sell: middle | 365243.0 | -482.0 | -152.0 | -182.0 | -177.0 | 1.0 |
| 4 | 1784265 | 202054 | Cash loans | 31924.395 | 337500.0 | 404055.0 | NaN | 337500.0 | THURSDAY | 9 | ... | XNA | 24.0 | high | Cash Street: high | NaN | NaN | NaN | NaN | NaN | NaN |
5 rows × 37 columns
POS_CASH_balance: shape is (10001358, 8) <class 'pandas.core.frame.DataFrame'> RangeIndex: 10001358 entries, 0 to 10001357 Data columns (total 8 columns): # Column Dtype --- ------ ----- 0 SK_ID_PREV int64 1 SK_ID_CURR int64 2 MONTHS_BALANCE int64 3 CNT_INSTALMENT float64 4 CNT_INSTALMENT_FUTURE float64 5 NAME_CONTRACT_STATUS object 6 SK_DPD int64 7 SK_DPD_DEF int64 dtypes: float64(2), int64(5), object(1) memory usage: 610.4+ MB None
| SK_ID_PREV | SK_ID_CURR | MONTHS_BALANCE | CNT_INSTALMENT | CNT_INSTALMENT_FUTURE | NAME_CONTRACT_STATUS | SK_DPD | SK_DPD_DEF | |
|---|---|---|---|---|---|---|---|---|
| 0 | 1803195 | 182943 | -31 | 48.0 | 45.0 | Active | 0 | 0 |
| 1 | 1715348 | 367990 | -33 | 36.0 | 35.0 | Active | 0 | 0 |
| 2 | 1784872 | 397406 | -32 | 12.0 | 9.0 | Active | 0 | 0 |
| 3 | 1903291 | 269225 | -35 | 48.0 | 42.0 | Active | 0 | 0 |
| 4 | 2341044 | 334279 | -35 | 36.0 | 35.0 | Active | 0 | 0 |
CPU times: total: 18.6 s Wall time: 31 s
for ds_name in datasets.keys():
print(f'dataset {ds_name:24}: [ {datasets[ds_name].shape[0]:10,}, {datasets[ds_name].shape[1]}]')
dataset application_train : [ 307,511, 122] dataset application_test : [ 48,744, 121] dataset bureau : [ 1,716,428, 17] dataset bureau_balance : [ 4,728,516, 3] dataset credit_card_balance : [ 3,840,312, 23] dataset installments_payments : [ 13,605,401, 8] dataset previous_application : [ 1,670,214, 37] dataset POS_CASH_balance : [ 10,001,358, 8]
def plot_missing_data(df, x, y):
g = sns.displot(
data=datasets[df].isna().melt(value_name="missing"),
y="variable",
hue="missing",
multiple="fill",
aspect=1.25
)
g.fig.set_figwidth(x)
g.fig.set_figheight(y)
datasets["application_train"].info()
datasets["application_train"].columns
datasets["application_train"].dtypes
datasets["application_train"].describe() #numerical only features
datasets["application_train"].describe(include='all') #look at all categorical and numerical
datasets["application_train"].corr()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 307511 entries, 0 to 307510 Columns: 122 entries, SK_ID_CURR to AMT_REQ_CREDIT_BUREAU_YEAR dtypes: float64(65), int64(41), object(16) memory usage: 286.2+ MB
| SK_ID_CURR | TARGET | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | AMT_GOODS_PRICE | REGION_POPULATION_RELATIVE | DAYS_BIRTH | DAYS_EMPLOYED | ... | FLAG_DOCUMENT_18 | FLAG_DOCUMENT_19 | FLAG_DOCUMENT_20 | FLAG_DOCUMENT_21 | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| SK_ID_CURR | 1.000000 | -0.002108 | -0.001129 | -0.001820 | -0.000343 | -0.000433 | -0.000232 | 0.000849 | -0.001500 | 0.001366 | ... | 0.000509 | 0.000167 | 0.001073 | 0.000282 | -0.002672 | -0.002193 | 0.002099 | 0.000485 | 0.001025 | 0.004659 |
| TARGET | -0.002108 | 1.000000 | 0.019187 | -0.003982 | -0.030369 | -0.012817 | -0.039645 | -0.037227 | 0.078239 | -0.044932 | ... | -0.007952 | -0.001358 | 0.000215 | 0.003709 | 0.000930 | 0.002704 | 0.000788 | -0.012462 | -0.002022 | 0.019930 |
| CNT_CHILDREN | -0.001129 | 0.019187 | 1.000000 | 0.012882 | 0.002145 | 0.021374 | -0.001827 | -0.025573 | 0.330938 | -0.239818 | ... | 0.004031 | 0.000864 | 0.000988 | -0.002450 | -0.000410 | -0.000366 | -0.002436 | -0.010808 | -0.007836 | -0.041550 |
| AMT_INCOME_TOTAL | -0.001820 | -0.003982 | 0.012882 | 1.000000 | 0.156870 | 0.191657 | 0.159610 | 0.074796 | 0.027261 | -0.064223 | ... | 0.003130 | 0.002408 | 0.000242 | -0.000589 | 0.000709 | 0.002944 | 0.002387 | 0.024700 | 0.004859 | 0.011690 |
| AMT_CREDIT | -0.000343 | -0.030369 | 0.002145 | 0.156870 | 1.000000 | 0.770138 | 0.986968 | 0.099738 | -0.055436 | -0.066838 | ... | 0.034329 | 0.021082 | 0.031023 | -0.016148 | -0.003906 | 0.004238 | -0.001275 | 0.054451 | 0.015925 | -0.048448 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| AMT_REQ_CREDIT_BUREAU_DAY | -0.002193 | 0.002704 | -0.000366 | 0.002944 | 0.004238 | 0.002185 | 0.004677 | 0.001399 | 0.002255 | 0.000472 | ... | 0.013281 | 0.001126 | -0.000120 | -0.001130 | 0.230374 | 1.000000 | 0.217412 | -0.005258 | -0.004416 | -0.003355 |
| AMT_REQ_CREDIT_BUREAU_WEEK | 0.002099 | 0.000788 | -0.002436 | 0.002387 | -0.001275 | 0.013881 | -0.001007 | -0.002149 | -0.001336 | 0.003072 | ... | -0.004640 | -0.001275 | -0.001770 | 0.000081 | 0.004706 | 0.217412 | 1.000000 | -0.014096 | -0.015115 | 0.018917 |
| AMT_REQ_CREDIT_BUREAU_MON | 0.000485 | -0.012462 | -0.010808 | 0.024700 | 0.054451 | 0.039148 | 0.056422 | 0.078607 | 0.001372 | -0.034457 | ... | -0.001565 | -0.002729 | 0.001285 | -0.003612 | -0.000018 | -0.005258 | -0.014096 | 1.000000 | -0.007789 | -0.004975 |
| AMT_REQ_CREDIT_BUREAU_QRT | 0.001025 | -0.002022 | -0.007836 | 0.004859 | 0.015925 | 0.010124 | 0.016432 | -0.001279 | -0.011799 | 0.015345 | ... | -0.005125 | -0.001575 | -0.001010 | -0.002004 | -0.002716 | -0.004416 | -0.015115 | -0.007789 | 1.000000 | 0.076208 |
| AMT_REQ_CREDIT_BUREAU_YEAR | 0.004659 | 0.019930 | -0.041550 | 0.011690 | -0.048448 | -0.011320 | -0.050998 | 0.001003 | -0.071983 | 0.049988 | ... | -0.047432 | -0.007009 | -0.012126 | -0.005457 | -0.004597 | -0.003355 | 0.018917 | -0.004975 | 0.076208 | 1.000000 |
106 rows × 106 columns
percent = (datasets["application_train"].isnull().sum()/datasets["application_train"].isnull().count()*100).sort_values(ascending = False).round(2)
sum_missing = datasets["application_train"].isna().sum().sort_values(ascending = False)
missing_application_train_data = pd.concat([percent, sum_missing], axis=1, keys=['Percent', "Train Missing Count"])
missing_application_train_data.head(20)
plot_missing_data("application_train", 18, 20)
datasets["application_test"].info()
datasets["application_test"].columns
datasets["application_test"].dtypes
datasets["application_test"].describe() #numerical only features
datasets["application_test"].describe(include='all')
datasets["application_test"].corr()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 48744 entries, 0 to 48743 Columns: 121 entries, SK_ID_CURR to AMT_REQ_CREDIT_BUREAU_YEAR dtypes: float64(65), int64(40), object(16) memory usage: 45.0+ MB
| SK_ID_CURR | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | AMT_GOODS_PRICE | REGION_POPULATION_RELATIVE | DAYS_BIRTH | DAYS_EMPLOYED | DAYS_REGISTRATION | ... | FLAG_DOCUMENT_18 | FLAG_DOCUMENT_19 | FLAG_DOCUMENT_20 | FLAG_DOCUMENT_21 | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| SK_ID_CURR | 1.000000 | 0.000635 | 0.001278 | 0.005014 | 0.007112 | 0.005097 | 0.003324 | 0.002325 | -0.000845 | 0.001032 | ... | -0.006286 | NaN | NaN | NaN | -0.000307 | 0.001083 | 0.001178 | 0.000430 | -0.002092 | 0.003457 |
| CNT_CHILDREN | 0.000635 | 1.000000 | 0.038962 | 0.027840 | 0.056770 | 0.025507 | -0.015231 | 0.317877 | -0.238319 | 0.175054 | ... | -0.000862 | NaN | NaN | NaN | 0.006362 | 0.001539 | 0.007523 | -0.008337 | 0.029006 | -0.039265 |
| AMT_INCOME_TOTAL | 0.001278 | 0.038962 | 1.000000 | 0.396572 | 0.457833 | 0.401995 | 0.199773 | 0.054400 | -0.154619 | 0.067973 | ... | -0.006624 | NaN | NaN | NaN | 0.010227 | 0.004989 | -0.002867 | 0.008691 | 0.007410 | 0.003281 |
| AMT_CREDIT | 0.005014 | 0.027840 | 0.396572 | 1.000000 | 0.777733 | 0.988056 | 0.135694 | -0.046169 | -0.083483 | 0.030740 | ... | -0.000197 | NaN | NaN | NaN | -0.001092 | 0.004882 | 0.002904 | -0.000156 | -0.007750 | -0.034533 |
| AMT_ANNUITY | 0.007112 | 0.056770 | 0.457833 | 0.777733 | 1.000000 | 0.787033 | 0.150864 | 0.047859 | -0.137772 | 0.064450 | ... | -0.010762 | NaN | NaN | NaN | 0.008428 | 0.006681 | 0.003085 | 0.005695 | 0.012443 | -0.044901 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| AMT_REQ_CREDIT_BUREAU_DAY | 0.001083 | 0.001539 | 0.004989 | 0.004882 | 0.006681 | 0.004865 | -0.011773 | -0.000386 | -0.000785 | -0.000152 | ... | -0.001515 | NaN | NaN | NaN | 0.151506 | 1.000000 | 0.035567 | 0.005877 | 0.006509 | 0.002002 |
| AMT_REQ_CREDIT_BUREAU_WEEK | 0.001178 | 0.007523 | -0.002867 | 0.002904 | 0.003085 | 0.003358 | -0.008321 | 0.012422 | -0.014058 | 0.008692 | ... | 0.009205 | NaN | NaN | NaN | -0.002345 | 0.035567 | 1.000000 | 0.054291 | 0.024957 | -0.000252 |
| AMT_REQ_CREDIT_BUREAU_MON | 0.000430 | -0.008337 | 0.008691 | -0.000156 | 0.005695 | -0.000254 | 0.000105 | 0.014094 | -0.013891 | 0.007414 | ... | -0.003248 | NaN | NaN | NaN | 0.023510 | 0.005877 | 0.054291 | 1.000000 | 0.005446 | 0.026118 |
| AMT_REQ_CREDIT_BUREAU_QRT | -0.002092 | 0.029006 | 0.007410 | -0.007750 | 0.012443 | -0.008490 | -0.026650 | 0.088752 | -0.044351 | 0.046011 | ... | -0.010480 | NaN | NaN | NaN | -0.003075 | 0.006509 | 0.024957 | 0.005446 | 1.000000 | -0.013081 |
| AMT_REQ_CREDIT_BUREAU_YEAR | 0.003457 | -0.039265 | 0.003281 | -0.034533 | -0.044901 | -0.036227 | 0.001015 | -0.095551 | 0.064698 | -0.036887 | ... | -0.009864 | NaN | NaN | NaN | 0.011938 | 0.002002 | -0.000252 | 0.026118 | -0.013081 | 1.000000 |
105 rows × 105 columns
percent = (datasets["application_test"].isnull().sum()/datasets["application_test"].isnull().count()*100).sort_values(ascending = False).round(2)
sum_missing = datasets["application_test"].isna().sum().sort_values(ascending = False)
missing_application_train_data = pd.concat([percent, sum_missing], axis=1, keys=['Percent', "Test Missing Count"])
missing_application_train_data.head(20)
plot_missing_data("application_test", 18, 20)
datasets["bureau"].info()
datasets["bureau"].columns
datasets["bureau"].dtypes
datasets["bureau"].describe()
datasets["bureau"].describe(include="all")
datasets["bureau"].corr()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1716428 entries, 0 to 1716427 Data columns (total 17 columns): # Column Dtype --- ------ ----- 0 SK_ID_CURR int64 1 SK_ID_BUREAU int64 2 CREDIT_ACTIVE object 3 CREDIT_CURRENCY object 4 DAYS_CREDIT int64 5 CREDIT_DAY_OVERDUE int64 6 DAYS_CREDIT_ENDDATE float64 7 DAYS_ENDDATE_FACT float64 8 AMT_CREDIT_MAX_OVERDUE float64 9 CNT_CREDIT_PROLONG int64 10 AMT_CREDIT_SUM float64 11 AMT_CREDIT_SUM_DEBT float64 12 AMT_CREDIT_SUM_LIMIT float64 13 AMT_CREDIT_SUM_OVERDUE float64 14 CREDIT_TYPE object 15 DAYS_CREDIT_UPDATE int64 16 AMT_ANNUITY float64 dtypes: float64(8), int64(6), object(3) memory usage: 222.6+ MB
| SK_ID_CURR | SK_ID_BUREAU | DAYS_CREDIT | CREDIT_DAY_OVERDUE | DAYS_CREDIT_ENDDATE | DAYS_ENDDATE_FACT | AMT_CREDIT_MAX_OVERDUE | CNT_CREDIT_PROLONG | AMT_CREDIT_SUM | AMT_CREDIT_SUM_DEBT | AMT_CREDIT_SUM_LIMIT | AMT_CREDIT_SUM_OVERDUE | DAYS_CREDIT_UPDATE | AMT_ANNUITY | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| SK_ID_CURR | 1.000000 | 0.000135 | 0.000266 | 0.000283 | 0.000456 | -0.000648 | 0.001329 | -0.000388 | 0.001179 | -0.000790 | -0.000304 | -0.000014 | 0.000510 | -0.002727 |
| SK_ID_BUREAU | 0.000135 | 1.000000 | 0.013015 | -0.002628 | 0.009107 | 0.017890 | 0.002290 | -0.000740 | 0.007962 | 0.005732 | -0.003986 | -0.000499 | 0.019398 | 0.001799 |
| DAYS_CREDIT | 0.000266 | 0.013015 | 1.000000 | -0.027266 | 0.225682 | 0.875359 | -0.014724 | -0.030460 | 0.050883 | 0.135397 | 0.025140 | -0.000383 | 0.688771 | 0.005676 |
| CREDIT_DAY_OVERDUE | 0.000283 | -0.002628 | -0.027266 | 1.000000 | -0.007352 | -0.008637 | 0.001249 | 0.002756 | -0.003292 | -0.002355 | -0.000345 | 0.090951 | -0.018461 | -0.000339 |
| DAYS_CREDIT_ENDDATE | 0.000456 | 0.009107 | 0.225682 | -0.007352 | 1.000000 | 0.248825 | 0.000577 | 0.113683 | 0.055424 | 0.081298 | 0.095421 | 0.001077 | 0.248525 | 0.000475 |
| DAYS_ENDDATE_FACT | -0.000648 | 0.017890 | 0.875359 | -0.008637 | 0.248825 | 1.000000 | 0.000999 | 0.012017 | 0.059096 | 0.019609 | 0.019476 | -0.000332 | 0.751294 | 0.006274 |
| AMT_CREDIT_MAX_OVERDUE | 0.001329 | 0.002290 | -0.014724 | 0.001249 | 0.000577 | 0.000999 | 1.000000 | 0.001523 | 0.081663 | 0.014007 | -0.000112 | 0.015036 | -0.000749 | 0.001578 |
| CNT_CREDIT_PROLONG | -0.000388 | -0.000740 | -0.030460 | 0.002756 | 0.113683 | 0.012017 | 0.001523 | 1.000000 | -0.008345 | -0.001366 | 0.073805 | 0.000002 | 0.017864 | -0.000465 |
| AMT_CREDIT_SUM | 0.001179 | 0.007962 | 0.050883 | -0.003292 | 0.055424 | 0.059096 | 0.081663 | -0.008345 | 1.000000 | 0.683419 | 0.003756 | 0.006342 | 0.104629 | 0.049146 |
| AMT_CREDIT_SUM_DEBT | -0.000790 | 0.005732 | 0.135397 | -0.002355 | 0.081298 | 0.019609 | 0.014007 | -0.001366 | 0.683419 | 1.000000 | -0.018215 | 0.008046 | 0.141235 | 0.025507 |
| AMT_CREDIT_SUM_LIMIT | -0.000304 | -0.003986 | 0.025140 | -0.000345 | 0.095421 | 0.019476 | -0.000112 | 0.073805 | 0.003756 | -0.018215 | 1.000000 | -0.000687 | 0.046028 | 0.004392 |
| AMT_CREDIT_SUM_OVERDUE | -0.000014 | -0.000499 | -0.000383 | 0.090951 | 0.001077 | -0.000332 | 0.015036 | 0.000002 | 0.006342 | 0.008046 | -0.000687 | 1.000000 | 0.003528 | 0.000344 |
| DAYS_CREDIT_UPDATE | 0.000510 | 0.019398 | 0.688771 | -0.018461 | 0.248525 | 0.751294 | -0.000749 | 0.017864 | 0.104629 | 0.141235 | 0.046028 | 0.003528 | 1.000000 | 0.008418 |
| AMT_ANNUITY | -0.002727 | 0.001799 | 0.005676 | -0.000339 | 0.000475 | 0.006274 | 0.001578 | -0.000465 | 0.049146 | 0.025507 | 0.004392 | 0.000344 | 0.008418 | 1.000000 |
percent = (datasets["bureau"].isnull().sum()/datasets["bureau"].isnull().count()*100).sort_values(ascending = False).round(2)
sum_missing = datasets["bureau"].isna().sum().sort_values(ascending = False)
missing_application_train_data = pd.concat([percent, sum_missing], axis=1, keys=['Percent', "Test Missing Count"])
missing_application_train_data.head(20)
plot_missing_data("bureau",18,20)
datasets["bureau_balance"].info()
datasets["bureau_balance"].columns
datasets["bureau_balance"].dtypes
datasets["bureau_balance"].describe()
datasets["bureau_balance"].describe(include='all')
datasets["bureau_balance"].corr()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 4728516 entries, 0 to 4728515 Data columns (total 3 columns): # Column Dtype --- ------ ----- 0 SK_ID_BUREAU int64 1 MONTHS_BALANCE float64 2 STATUS object dtypes: float64(1), int64(1), object(1) memory usage: 108.2+ MB
| SK_ID_BUREAU | MONTHS_BALANCE | |
|---|---|---|
| SK_ID_BUREAU | 1.000000 | -0.013447 |
| MONTHS_BALANCE | -0.013447 | 1.000000 |
percent = (datasets["bureau_balance"].isnull().sum()/datasets["bureau_balance"].isnull().count()*100).sort_values(ascending = False).round(2)
sum_missing = datasets["bureau_balance"].isna().sum().sort_values(ascending = False)
missing_application_train_data = pd.concat([percent, sum_missing], axis=1, keys=['Percent', "Test Missing Count"])
missing_application_train_data.head(20)
plot_missing_data("bureau_balance",18,20)
datasets["POS_CASH_balance"].info()
datasets["POS_CASH_balance"].columns
datasets["POS_CASH_balance"].dtypes
datasets["POS_CASH_balance"].describe()
datasets["POS_CASH_balance"].describe(include='all')
datasets["POS_CASH_balance"].corr()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 10001358 entries, 0 to 10001357 Data columns (total 8 columns): # Column Dtype --- ------ ----- 0 SK_ID_PREV int64 1 SK_ID_CURR int64 2 MONTHS_BALANCE int64 3 CNT_INSTALMENT float64 4 CNT_INSTALMENT_FUTURE float64 5 NAME_CONTRACT_STATUS object 6 SK_DPD int64 7 SK_DPD_DEF int64 dtypes: float64(2), int64(5), object(1) memory usage: 610.4+ MB
| SK_ID_PREV | SK_ID_CURR | MONTHS_BALANCE | CNT_INSTALMENT | CNT_INSTALMENT_FUTURE | SK_DPD | SK_DPD_DEF | |
|---|---|---|---|---|---|---|---|
| SK_ID_PREV | 1.000000 | -0.000336 | 0.001835 | 0.003820 | 0.003679 | -0.000487 | 0.004848 |
| SK_ID_CURR | -0.000336 | 1.000000 | 0.000404 | 0.000144 | -0.000559 | 0.003118 | 0.001948 |
| MONTHS_BALANCE | 0.001835 | 0.000404 | 1.000000 | 0.336163 | 0.271595 | -0.018939 | -0.000381 |
| CNT_INSTALMENT | 0.003820 | 0.000144 | 0.336163 | 1.000000 | 0.871276 | -0.060803 | -0.014154 |
| CNT_INSTALMENT_FUTURE | 0.003679 | -0.000559 | 0.271595 | 0.871276 | 1.000000 | -0.082004 | -0.017436 |
| SK_DPD | -0.000487 | 0.003118 | -0.018939 | -0.060803 | -0.082004 | 1.000000 | 0.245782 |
| SK_DPD_DEF | 0.004848 | 0.001948 | -0.000381 | -0.014154 | -0.017436 | 0.245782 | 1.000000 |
percent = (datasets["POS_CASH_balance"].isnull().sum()/datasets["POS_CASH_balance"].isnull().count()*100).sort_values(ascending = False).round(2)
sum_missing = datasets["POS_CASH_balance"].isna().sum().sort_values(ascending = False)
missing_application_train_data = pd.concat([percent, sum_missing], axis=1, keys=['Percent', "Test Missing Count"])
missing_application_train_data.head(20)
plot_missing_data("POS_CASH_balance",18,20)
datasets["credit_card_balance"].info()
datasets["credit_card_balance"].columns
datasets["credit_card_balance"].dtypes
datasets["credit_card_balance"].describe()
datasets["credit_card_balance"].describe(include="all")
datasets["credit_card_balance"].corr()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 3840312 entries, 0 to 3840311 Data columns (total 23 columns): # Column Dtype --- ------ ----- 0 SK_ID_PREV int64 1 SK_ID_CURR int64 2 MONTHS_BALANCE int64 3 AMT_BALANCE float64 4 AMT_CREDIT_LIMIT_ACTUAL int64 5 AMT_DRAWINGS_ATM_CURRENT float64 6 AMT_DRAWINGS_CURRENT float64 7 AMT_DRAWINGS_OTHER_CURRENT float64 8 AMT_DRAWINGS_POS_CURRENT float64 9 AMT_INST_MIN_REGULARITY float64 10 AMT_PAYMENT_CURRENT float64 11 AMT_PAYMENT_TOTAL_CURRENT float64 12 AMT_RECEIVABLE_PRINCIPAL float64 13 AMT_RECIVABLE float64 14 AMT_TOTAL_RECEIVABLE float64 15 CNT_DRAWINGS_ATM_CURRENT float64 16 CNT_DRAWINGS_CURRENT int64 17 CNT_DRAWINGS_OTHER_CURRENT float64 18 CNT_DRAWINGS_POS_CURRENT float64 19 CNT_INSTALMENT_MATURE_CUM float64 20 NAME_CONTRACT_STATUS object 21 SK_DPD int64 22 SK_DPD_DEF int64 dtypes: float64(15), int64(7), object(1) memory usage: 673.9+ MB
| SK_ID_PREV | SK_ID_CURR | MONTHS_BALANCE | AMT_BALANCE | AMT_CREDIT_LIMIT_ACTUAL | AMT_DRAWINGS_ATM_CURRENT | AMT_DRAWINGS_CURRENT | AMT_DRAWINGS_OTHER_CURRENT | AMT_DRAWINGS_POS_CURRENT | AMT_INST_MIN_REGULARITY | ... | AMT_RECEIVABLE_PRINCIPAL | AMT_RECIVABLE | AMT_TOTAL_RECEIVABLE | CNT_DRAWINGS_ATM_CURRENT | CNT_DRAWINGS_CURRENT | CNT_DRAWINGS_OTHER_CURRENT | CNT_DRAWINGS_POS_CURRENT | CNT_INSTALMENT_MATURE_CUM | SK_DPD | SK_DPD_DEF | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| SK_ID_PREV | 1.000000 | 0.004723 | 0.003670 | 0.005046 | 0.006631 | 0.004342 | 0.002624 | -0.000160 | 0.001721 | 0.006460 | ... | 0.005140 | 0.005035 | 0.005032 | 0.002821 | 0.000367 | -0.001412 | 0.000809 | -0.007219 | -0.001786 | 0.001973 |
| SK_ID_CURR | 0.004723 | 1.000000 | 0.001696 | 0.003510 | 0.005991 | 0.000814 | 0.000708 | 0.000958 | -0.000786 | 0.003300 | ... | 0.003589 | 0.003518 | 0.003524 | 0.002082 | 0.002654 | -0.000131 | 0.002135 | -0.000581 | -0.000962 | 0.001519 |
| MONTHS_BALANCE | 0.003670 | 0.001696 | 1.000000 | 0.014558 | 0.199900 | 0.036802 | 0.065527 | 0.000405 | 0.118146 | -0.087529 | ... | 0.016266 | 0.013172 | 0.013084 | 0.002536 | 0.113321 | -0.026192 | 0.160207 | -0.008620 | 0.039434 | 0.001659 |
| AMT_BALANCE | 0.005046 | 0.003510 | 0.014558 | 1.000000 | 0.489386 | 0.283551 | 0.336965 | 0.065366 | 0.169449 | 0.896728 | ... | 0.999720 | 0.999917 | 0.999897 | 0.309968 | 0.259184 | 0.046563 | 0.155553 | 0.005009 | -0.046988 | 0.013009 |
| AMT_CREDIT_LIMIT_ACTUAL | 0.006631 | 0.005991 | 0.199900 | 0.489386 | 1.000000 | 0.247219 | 0.263093 | 0.050579 | 0.234976 | 0.467620 | ... | 0.490445 | 0.488641 | 0.488598 | 0.221808 | 0.204237 | 0.030051 | 0.202868 | -0.157269 | -0.038791 | -0.002236 |
| AMT_DRAWINGS_ATM_CURRENT | 0.004342 | 0.000814 | 0.036802 | 0.283551 | 0.247219 | 1.000000 | 0.800190 | 0.017899 | 0.078971 | 0.094824 | ... | 0.280402 | 0.278290 | 0.278260 | 0.732907 | 0.298173 | 0.013254 | 0.076083 | -0.103721 | -0.022044 | -0.003360 |
| AMT_DRAWINGS_CURRENT | 0.002624 | 0.000708 | 0.065527 | 0.336965 | 0.263093 | 0.800190 | 1.000000 | 0.236297 | 0.615591 | 0.124469 | ... | 0.337117 | 0.332831 | 0.332796 | 0.594361 | 0.523016 | 0.140032 | 0.359001 | -0.093491 | -0.020606 | -0.003137 |
| AMT_DRAWINGS_OTHER_CURRENT | -0.000160 | 0.000958 | 0.000405 | 0.065366 | 0.050579 | 0.017899 | 0.236297 | 1.000000 | 0.007382 | 0.002158 | ... | 0.066108 | 0.064929 | 0.064923 | 0.012008 | 0.021271 | 0.575295 | 0.004458 | -0.023013 | -0.003693 | -0.000568 |
| AMT_DRAWINGS_POS_CURRENT | 0.001721 | -0.000786 | 0.118146 | 0.169449 | 0.234976 | 0.078971 | 0.615591 | 0.007382 | 1.000000 | 0.063562 | ... | 0.173745 | 0.168974 | 0.168950 | 0.072658 | 0.520123 | 0.007620 | 0.542556 | -0.106813 | -0.015040 | -0.002384 |
| AMT_INST_MIN_REGULARITY | 0.006460 | 0.003300 | -0.087529 | 0.896728 | 0.467620 | 0.094824 | 0.124469 | 0.002158 | 0.063562 | 1.000000 | ... | 0.896030 | 0.897617 | 0.897587 | 0.170616 | 0.148262 | 0.014360 | 0.086729 | 0.064320 | -0.061484 | -0.005715 |
| AMT_PAYMENT_CURRENT | 0.003472 | 0.000127 | 0.076355 | 0.143934 | 0.308294 | 0.189075 | 0.337343 | 0.034577 | 0.321055 | 0.333909 | ... | 0.143162 | 0.142389 | 0.142371 | 0.142935 | 0.223483 | 0.017246 | 0.195074 | -0.079266 | -0.030222 | -0.004340 |
| AMT_PAYMENT_TOTAL_CURRENT | 0.001641 | 0.000784 | 0.035614 | 0.151349 | 0.226570 | 0.159186 | 0.305726 | 0.025123 | 0.301760 | 0.335201 | ... | 0.149936 | 0.149926 | 0.149914 | 0.125655 | 0.217857 | 0.014041 | 0.183973 | -0.023156 | -0.022475 | -0.003443 |
| AMT_RECEIVABLE_PRINCIPAL | 0.005140 | 0.003589 | 0.016266 | 0.999720 | 0.490445 | 0.280402 | 0.337117 | 0.066108 | 0.173745 | 0.896030 | ... | 1.000000 | 0.999727 | 0.999702 | 0.302627 | 0.258848 | 0.046543 | 0.157723 | 0.003664 | -0.048290 | 0.006780 |
| AMT_RECIVABLE | 0.005035 | 0.003518 | 0.013172 | 0.999917 | 0.488641 | 0.278290 | 0.332831 | 0.064929 | 0.168974 | 0.897617 | ... | 0.999727 | 1.000000 | 0.999995 | 0.303571 | 0.256347 | 0.046118 | 0.154507 | 0.005935 | -0.046434 | 0.015466 |
| AMT_TOTAL_RECEIVABLE | 0.005032 | 0.003524 | 0.013084 | 0.999897 | 0.488598 | 0.278260 | 0.332796 | 0.064923 | 0.168950 | 0.897587 | ... | 0.999702 | 0.999995 | 1.000000 | 0.303542 | 0.256317 | 0.046113 | 0.154481 | 0.005959 | -0.046047 | 0.017243 |
| CNT_DRAWINGS_ATM_CURRENT | 0.002821 | 0.002082 | 0.002536 | 0.309968 | 0.221808 | 0.732907 | 0.594361 | 0.012008 | 0.072658 | 0.170616 | ... | 0.302627 | 0.303571 | 0.303542 | 1.000000 | 0.410907 | 0.012730 | 0.108388 | -0.103403 | -0.029395 | -0.004277 |
| CNT_DRAWINGS_CURRENT | 0.000367 | 0.002654 | 0.113321 | 0.259184 | 0.204237 | 0.298173 | 0.523016 | 0.021271 | 0.520123 | 0.148262 | ... | 0.258848 | 0.256347 | 0.256317 | 0.410907 | 1.000000 | 0.033940 | 0.950546 | -0.099186 | -0.020786 | -0.003106 |
| CNT_DRAWINGS_OTHER_CURRENT | -0.001412 | -0.000131 | -0.026192 | 0.046563 | 0.030051 | 0.013254 | 0.140032 | 0.575295 | 0.007620 | 0.014360 | ... | 0.046543 | 0.046118 | 0.046113 | 0.012730 | 0.033940 | 1.000000 | 0.007203 | -0.021632 | -0.006083 | -0.000895 |
| CNT_DRAWINGS_POS_CURRENT | 0.000809 | 0.002135 | 0.160207 | 0.155553 | 0.202868 | 0.076083 | 0.359001 | 0.004458 | 0.542556 | 0.086729 | ... | 0.157723 | 0.154507 | 0.154481 | 0.108388 | 0.950546 | 0.007203 | 1.000000 | -0.129338 | -0.018212 | -0.002840 |
| CNT_INSTALMENT_MATURE_CUM | -0.007219 | -0.000581 | -0.008620 | 0.005009 | -0.157269 | -0.103721 | -0.093491 | -0.023013 | -0.106813 | 0.064320 | ... | 0.003664 | 0.005935 | 0.005959 | -0.103403 | -0.099186 | -0.021632 | -0.129338 | 1.000000 | 0.059654 | 0.002156 |
| SK_DPD | -0.001786 | -0.000962 | 0.039434 | -0.046988 | -0.038791 | -0.022044 | -0.020606 | -0.003693 | -0.015040 | -0.061484 | ... | -0.048290 | -0.046434 | -0.046047 | -0.029395 | -0.020786 | -0.006083 | -0.018212 | 0.059654 | 1.000000 | 0.218950 |
| SK_DPD_DEF | 0.001973 | 0.001519 | 0.001659 | 0.013009 | -0.002236 | -0.003360 | -0.003137 | -0.000568 | -0.002384 | -0.005715 | ... | 0.006780 | 0.015466 | 0.017243 | -0.004277 | -0.003106 | -0.000895 | -0.002840 | 0.002156 | 0.218950 | 1.000000 |
22 rows × 22 columns
percent = (datasets["credit_card_balance"].isnull().sum()/datasets["credit_card_balance"].isnull().count()*100).sort_values(ascending = False).round(2)
sum_missing = datasets["credit_card_balance"].isna().sum().sort_values(ascending = False)
missing_application_train_data = pd.concat([percent, sum_missing], axis=1, keys=['Percent', "Test Missing Count"])
missing_application_train_data.head(20)
plot_missing_data('credit_card_balance',18,20)
datasets["previous_application"].info()
datasets["previous_application"].columns
datasets["previous_application"].dtypes
datasets["previous_application"].describe()
datasets["previous_application"].describe(include='all')
datasets["previous_application"].corr()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1670214 entries, 0 to 1670213 Data columns (total 37 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 SK_ID_PREV 1670214 non-null int64 1 SK_ID_CURR 1670214 non-null int64 2 NAME_CONTRACT_TYPE 1670214 non-null object 3 AMT_ANNUITY 1297979 non-null float64 4 AMT_APPLICATION 1670214 non-null float64 5 AMT_CREDIT 1670213 non-null float64 6 AMT_DOWN_PAYMENT 774370 non-null float64 7 AMT_GOODS_PRICE 1284699 non-null float64 8 WEEKDAY_APPR_PROCESS_START 1670214 non-null object 9 HOUR_APPR_PROCESS_START 1670214 non-null int64 10 FLAG_LAST_APPL_PER_CONTRACT 1670214 non-null object 11 NFLAG_LAST_APPL_IN_DAY 1670214 non-null int64 12 RATE_DOWN_PAYMENT 774370 non-null float64 13 RATE_INTEREST_PRIMARY 5951 non-null float64 14 RATE_INTEREST_PRIVILEGED 5951 non-null float64 15 NAME_CASH_LOAN_PURPOSE 1670214 non-null object 16 NAME_CONTRACT_STATUS 1670214 non-null object 17 DAYS_DECISION 1670214 non-null int64 18 NAME_PAYMENT_TYPE 1670214 non-null object 19 CODE_REJECT_REASON 1670214 non-null object 20 NAME_TYPE_SUITE 849809 non-null object 21 NAME_CLIENT_TYPE 1670214 non-null object 22 NAME_GOODS_CATEGORY 1670214 non-null object 23 NAME_PORTFOLIO 1670214 non-null object 24 NAME_PRODUCT_TYPE 1670214 non-null object 25 CHANNEL_TYPE 1670214 non-null object 26 SELLERPLACE_AREA 1670214 non-null int64 27 NAME_SELLER_INDUSTRY 1670214 non-null object 28 CNT_PAYMENT 1297984 non-null float64 29 NAME_YIELD_GROUP 1670214 non-null object 30 PRODUCT_COMBINATION 1669868 non-null object 31 DAYS_FIRST_DRAWING 997149 non-null float64 32 DAYS_FIRST_DUE 997149 non-null float64 33 DAYS_LAST_DUE_1ST_VERSION 997149 non-null float64 34 DAYS_LAST_DUE 997149 non-null float64 35 DAYS_TERMINATION 997149 non-null float64 36 NFLAG_INSURED_ON_APPROVAL 997149 non-null float64 dtypes: float64(15), int64(6), object(16) memory usage: 471.5+ MB
| SK_ID_PREV | SK_ID_CURR | AMT_ANNUITY | AMT_APPLICATION | AMT_CREDIT | AMT_DOWN_PAYMENT | AMT_GOODS_PRICE | HOUR_APPR_PROCESS_START | NFLAG_LAST_APPL_IN_DAY | RATE_DOWN_PAYMENT | ... | RATE_INTEREST_PRIVILEGED | DAYS_DECISION | SELLERPLACE_AREA | CNT_PAYMENT | DAYS_FIRST_DRAWING | DAYS_FIRST_DUE | DAYS_LAST_DUE_1ST_VERSION | DAYS_LAST_DUE | DAYS_TERMINATION | NFLAG_INSURED_ON_APPROVAL | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| SK_ID_PREV | 1.000000 | -0.000321 | 0.011459 | 0.003302 | 0.003659 | -0.001313 | 0.015293 | -0.002652 | -0.002828 | -0.004051 | ... | -0.022312 | 0.019100 | -0.001079 | 0.015589 | -0.001478 | -0.000071 | 0.001222 | 0.001915 | 0.001781 | 0.003986 |
| SK_ID_CURR | -0.000321 | 1.000000 | 0.000577 | 0.000280 | 0.000195 | -0.000063 | 0.000369 | 0.002842 | 0.000098 | 0.001158 | ... | -0.016757 | -0.000637 | 0.001265 | 0.000031 | -0.001329 | -0.000757 | 0.000252 | -0.000318 | -0.000020 | 0.000876 |
| AMT_ANNUITY | 0.011459 | 0.000577 | 1.000000 | 0.808872 | 0.816429 | 0.267694 | 0.820895 | -0.036201 | 0.020639 | -0.103878 | ... | -0.202335 | 0.279051 | -0.015027 | 0.394535 | 0.052839 | -0.053295 | -0.068877 | 0.082659 | 0.068022 | 0.283080 |
| AMT_APPLICATION | 0.003302 | 0.000280 | 0.808872 | 1.000000 | 0.975824 | 0.482776 | 0.999884 | -0.014415 | 0.004310 | -0.072479 | ... | -0.199733 | 0.133660 | -0.007649 | 0.680630 | 0.074544 | -0.049532 | -0.084905 | 0.172627 | 0.148618 | 0.259219 |
| AMT_CREDIT | 0.003659 | 0.000195 | 0.816429 | 0.975824 | 1.000000 | 0.301284 | 0.993087 | -0.021039 | -0.025179 | -0.188128 | ... | -0.205158 | 0.133763 | -0.009567 | 0.674278 | -0.036813 | 0.002881 | 0.044031 | 0.224829 | 0.214320 | 0.263932 |
| AMT_DOWN_PAYMENT | -0.001313 | -0.000063 | 0.267694 | 0.482776 | 0.301284 | 1.000000 | 0.482776 | 0.016776 | 0.001597 | 0.473935 | ... | -0.115343 | -0.024536 | 0.003533 | 0.031659 | -0.001773 | -0.013586 | -0.000869 | -0.031425 | -0.030702 | -0.042585 |
| AMT_GOODS_PRICE | 0.015293 | 0.000369 | 0.820895 | 0.999884 | 0.993087 | 0.482776 | 1.000000 | -0.045267 | -0.017100 | -0.072479 | ... | -0.199733 | 0.290422 | -0.015842 | 0.672129 | -0.024445 | -0.021062 | 0.016883 | 0.211696 | 0.209296 | 0.243400 |
| HOUR_APPR_PROCESS_START | -0.002652 | 0.002842 | -0.036201 | -0.014415 | -0.021039 | 0.016776 | -0.045267 | 1.000000 | 0.005789 | 0.025930 | ... | -0.045720 | -0.039962 | 0.015671 | -0.055511 | 0.014321 | -0.002797 | -0.016567 | -0.018018 | -0.018254 | -0.117318 |
| NFLAG_LAST_APPL_IN_DAY | -0.002828 | 0.000098 | 0.020639 | 0.004310 | -0.025179 | 0.001597 | -0.017100 | 0.005789 | 1.000000 | 0.004554 | ... | 0.024640 | 0.016555 | 0.000912 | 0.063347 | -0.000409 | -0.002288 | -0.001981 | -0.002277 | -0.000744 | -0.007124 |
| RATE_DOWN_PAYMENT | -0.004051 | 0.001158 | -0.103878 | -0.072479 | -0.188128 | 0.473935 | -0.072479 | 0.025930 | 0.004554 | 1.000000 | ... | -0.106143 | -0.208742 | -0.006489 | -0.278875 | -0.007969 | -0.039178 | -0.010934 | -0.147562 | -0.145461 | -0.021633 |
| RATE_INTEREST_PRIMARY | 0.012969 | 0.033197 | 0.141823 | 0.110001 | 0.125106 | 0.016323 | 0.110001 | -0.027172 | 0.009604 | -0.103373 | ... | -0.001937 | 0.014037 | 0.159182 | -0.019030 | NaN | -0.017171 | -0.000933 | -0.010677 | -0.011099 | 0.311938 |
| RATE_INTEREST_PRIVILEGED | -0.022312 | -0.016757 | -0.202335 | -0.199733 | -0.205158 | -0.115343 | -0.199733 | -0.045720 | 0.024640 | -0.106143 | ... | 1.000000 | 0.631940 | -0.066316 | -0.057150 | NaN | 0.150904 | 0.030513 | 0.372214 | 0.378671 | -0.067157 |
| DAYS_DECISION | 0.019100 | -0.000637 | 0.279051 | 0.133660 | 0.133763 | -0.024536 | 0.290422 | -0.039962 | 0.016555 | -0.208742 | ... | 0.631940 | 1.000000 | -0.018382 | 0.246453 | -0.012007 | 0.176711 | 0.089167 | 0.448549 | 0.400179 | -0.028905 |
| SELLERPLACE_AREA | -0.001079 | 0.001265 | -0.015027 | -0.007649 | -0.009567 | 0.003533 | -0.015842 | 0.015671 | 0.000912 | -0.006489 | ... | -0.066316 | -0.018382 | 1.000000 | -0.010646 | 0.007401 | -0.002166 | -0.007510 | -0.006291 | -0.006675 | -0.018280 |
| CNT_PAYMENT | 0.015589 | 0.000031 | 0.394535 | 0.680630 | 0.674278 | 0.031659 | 0.672129 | -0.055511 | 0.063347 | -0.278875 | ... | -0.057150 | 0.246453 | -0.010646 | 1.000000 | 0.309900 | -0.204907 | -0.381013 | 0.088903 | 0.055121 | 0.320520 |
| DAYS_FIRST_DRAWING | -0.001478 | -0.001329 | 0.052839 | 0.074544 | -0.036813 | -0.001773 | -0.024445 | 0.014321 | -0.000409 | -0.007969 | ... | NaN | -0.012007 | 0.007401 | 0.309900 | 1.000000 | 0.004710 | -0.803494 | -0.257466 | -0.396284 | 0.177652 |
| DAYS_FIRST_DUE | -0.000071 | -0.000757 | -0.053295 | -0.049532 | 0.002881 | -0.013586 | -0.021062 | -0.002797 | -0.002288 | -0.039178 | ... | 0.150904 | 0.176711 | -0.002166 | -0.204907 | 0.004710 | 1.000000 | 0.513949 | 0.401838 | 0.323608 | -0.119048 |
| DAYS_LAST_DUE_1ST_VERSION | 0.001222 | 0.000252 | -0.068877 | -0.084905 | 0.044031 | -0.000869 | 0.016883 | -0.016567 | -0.001981 | -0.010934 | ... | 0.030513 | 0.089167 | -0.007510 | -0.381013 | -0.803494 | 0.513949 | 1.000000 | 0.423462 | 0.493174 | -0.221947 |
| DAYS_LAST_DUE | 0.001915 | -0.000318 | 0.082659 | 0.172627 | 0.224829 | -0.031425 | 0.211696 | -0.018018 | -0.002277 | -0.147562 | ... | 0.372214 | 0.448549 | -0.006291 | 0.088903 | -0.257466 | 0.401838 | 0.423462 | 1.000000 | 0.927990 | 0.012560 |
| DAYS_TERMINATION | 0.001781 | -0.000020 | 0.068022 | 0.148618 | 0.214320 | -0.030702 | 0.209296 | -0.018254 | -0.000744 | -0.145461 | ... | 0.378671 | 0.400179 | -0.006675 | 0.055121 | -0.396284 | 0.323608 | 0.493174 | 0.927990 | 1.000000 | -0.003065 |
| NFLAG_INSURED_ON_APPROVAL | 0.003986 | 0.000876 | 0.283080 | 0.259219 | 0.263932 | -0.042585 | 0.243400 | -0.117318 | -0.007124 | -0.021633 | ... | -0.067157 | -0.028905 | -0.018280 | 0.320520 | 0.177652 | -0.119048 | -0.221947 | 0.012560 | -0.003065 | 1.000000 |
21 rows × 21 columns
percent = (datasets["previous_application"].isnull().sum()/datasets["previous_application"].isnull().count()*100).sort_values(ascending = False).round(2)
sum_missing = datasets["previous_application"].isna().sum().sort_values(ascending = False)
missing_application_train_data = pd.concat([percent, sum_missing], axis=1, keys=['Percent', "Test Missing Count"])
missing_application_train_data.head(20)
plot_missing_data("previous_application",18,20)
datasets["installments_payments"].info()
datasets["installments_payments"].columns
datasets["installments_payments"].dtypes
datasets["installments_payments"].describe()
datasets["installments_payments"].describe(include='all')
datasets["installments_payments"].corr()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 13605401 entries, 0 to 13605400 Data columns (total 8 columns): # Column Dtype --- ------ ----- 0 SK_ID_PREV int64 1 SK_ID_CURR int64 2 NUM_INSTALMENT_VERSION float64 3 NUM_INSTALMENT_NUMBER int64 4 DAYS_INSTALMENT float64 5 DAYS_ENTRY_PAYMENT float64 6 AMT_INSTALMENT float64 7 AMT_PAYMENT float64 dtypes: float64(5), int64(3) memory usage: 830.4 MB
| SK_ID_PREV | SK_ID_CURR | NUM_INSTALMENT_VERSION | NUM_INSTALMENT_NUMBER | DAYS_INSTALMENT | DAYS_ENTRY_PAYMENT | AMT_INSTALMENT | AMT_PAYMENT | |
|---|---|---|---|---|---|---|---|---|
| SK_ID_PREV | 1.000000 | 0.002132 | 0.000685 | -0.002095 | 0.003748 | 0.003734 | 0.002042 | 0.001887 |
| SK_ID_CURR | 0.002132 | 1.000000 | 0.000480 | -0.000548 | 0.001191 | 0.001215 | -0.000226 | -0.000124 |
| NUM_INSTALMENT_VERSION | 0.000685 | 0.000480 | 1.000000 | -0.323414 | 0.130244 | 0.128124 | 0.168109 | 0.177176 |
| NUM_INSTALMENT_NUMBER | -0.002095 | -0.000548 | -0.323414 | 1.000000 | 0.090286 | 0.094305 | -0.089640 | -0.087664 |
| DAYS_INSTALMENT | 0.003748 | 0.001191 | 0.130244 | 0.090286 | 1.000000 | 0.999491 | 0.125985 | 0.127018 |
| DAYS_ENTRY_PAYMENT | 0.003734 | 0.001215 | 0.128124 | 0.094305 | 0.999491 | 1.000000 | 0.125555 | 0.126602 |
| AMT_INSTALMENT | 0.002042 | -0.000226 | 0.168109 | -0.089640 | 0.125985 | 0.125555 | 1.000000 | 0.937191 |
| AMT_PAYMENT | 0.001887 | -0.000124 | 0.177176 | -0.087664 | 0.127018 | 0.126602 | 0.937191 | 1.000000 |
percent = (datasets["installments_payments"].isnull().sum()/datasets["installments_payments"].isnull().count()*100).sort_values(ascending = False).round(2)
sum_missing = datasets["installments_payments"].isna().sum().sort_values(ascending = False)
missing_application_train_data = pd.concat([percent, sum_missing], axis=1, keys=['Percent', "Test Missing Count"])
missing_application_train_data.head(20)
plot_missing_data("installments_payments",18,20)
import matplotlib.pyplot as plt
%matplotlib inline
datasets["application_train"]['TARGET'].astype(int).plot.hist();
plt.figure(figsize = (5, 5))
sns.boxplot(data = datasets["application_train"], x = 'AMT_INCOME_TOTAL')
plt.xlim(0,1000000)
plt.xlabel('Income Total Amount');
plt.title('Distribution of Income Total Amount');
plt.show()
sns.catplot(data=datasets["application_train"][datasets["application_train"].TARGET==1],x='NAME_INCOME_TYPE',kind='count',hue="TARGET")
plt.xlabel('Income types')
plt.ylabel('Number of borrowers')
plt.title('Number of borrowers against target value based on Income Types')
plt.xticks(rotation=75)
sns.catplot(data=datasets["application_train"][datasets["application_train"].TARGET==0],x='NAME_INCOME_TYPE',kind='count',hue="TARGET", palette = ['purple'])
plt.xlabel('Income types')
plt.ylabel('Number of borrowers')
plt.title('Number of borrowers against target value based on Income Types')
plt.xticks(rotation=75)
(array([0, 1, 2, 3, 4, 5, 6, 7]), [Text(0, 0, 'State servant'), Text(1, 0, 'Working'), Text(2, 0, 'Commercial associate'), Text(3, 0, 'Pensioner'), Text(4, 0, 'Unemployed'), Text(5, 0, 'Student'), Text(6, 0, 'Businessman'), Text(7, 0, 'Maternity leave')])
plt.figure(figsize = (5, 5))
sns.distplot(datasets["application_train"].AMT_CREDIT)
plt.xlabel('Amount Credit');
plt.ylabel('Density distribution');
plt.title('Amount Credit against the density');
plt.show()
Income_credit = datasets["application_train"][['AMT_INCOME_TOTAL','AMT_CREDIT','TARGET']]
Income_credit['Ratio'] = (Income_credit['AMT_INCOME_TOTAL']/Income_credit['AMT_CREDIT'])
Income_credit
| AMT_INCOME_TOTAL | AMT_CREDIT | TARGET | Ratio | |
|---|---|---|---|---|
| 0 | 202500.0 | 406597.5 | 1 | 0.498036 |
| 1 | 270000.0 | 1293502.5 | 0 | 0.208736 |
| 2 | 67500.0 | 135000.0 | 0 | 0.500000 |
| 3 | 135000.0 | 312682.5 | 0 | 0.431748 |
| 4 | 121500.0 | 513000.0 | 0 | 0.236842 |
| ... | ... | ... | ... | ... |
| 307506 | 157500.0 | 254700.0 | 0 | 0.618375 |
| 307507 | 72000.0 | 269550.0 | 0 | 0.267112 |
| 307508 | 153000.0 | 677664.0 | 0 | 0.225776 |
| 307509 | 171000.0 | 370107.0 | 1 | 0.462029 |
| 307510 | 157500.0 | 675000.0 | 0 | 0.233333 |
307511 rows × 4 columns
import numpy as np
def count_bins(df):
count_dict={}
for i in range(len(df)):
if df["TARGET"].iloc[i] == 0 and (df["Ratio"].iloc[i]<0.1 and df["Ratio"].iloc[i]>=0):
if(0 in count_dict):
count_dict[0]+=1
else:
count_dict[0]=1
if df["TARGET"].iloc[i] == 0 and (df["Ratio"].iloc[i]<0.2 and df["Ratio"].iloc[i]>=0.1):
if(0 in count_dict):
count_dict[1]+=1
else:
count_dict[1]=1
if df["TARGET"].iloc[i] == 0 and (df["Ratio"].iloc[i]<0.3 and df["Ratio"].iloc[i]>=0.2):
if(0 in count_dict):
count_dict[2]+=1
else:
count_dict[2]=1
if df["TARGET"].iloc[i] == 0 and (df["Ratio"].iloc[i]<0.4 and df["Ratio"].iloc[i]>=0.3):
if(0 in count_dict):
count_dict[3]+=1
else:
count_dict[3]=1
if df["TARGET"].iloc[i] == 0 and (df["Ratio"].iloc[i]<0.5 and df["Ratio"].iloc[i]>=0.4):
if(0 in count_dict):
count_dict[4]+=1
else:
count_dict[4]=1
if df["TARGET"].iloc[i] == 0 and (df["Ratio"].iloc[i]<0.6 and df["Ratio"].iloc[i]>=0.5):
if(0 in count_dict):
count_dict[5]+=1
else:
count_dict[5]=1
if df["TARGET"].iloc[i] == 0 and (df["Ratio"].iloc[i]<0.7 and df["Ratio"].iloc[i]>=0.6):
if(0 in count_dict):
count_dict[6]+=1
else:
count_dict[6]=1
if df["TARGET"].iloc[i] == 0 and (df["Ratio"].iloc[i]<0.8 and df["Ratio"].iloc[i]>=0.7):
if(0 in count_dict):
count_dict[7]+=1
else:
count_dict[7]=1
if df["TARGET"].iloc[i] == 0 and (df["Ratio"].iloc[i]<0.9 and df["Ratio"].iloc[i]>=0.8):
if(0 in count_dict):
count_dict[8]+=1
else:
count_dict[8]=1
if df["TARGET"].iloc[i] == 0 and (df["Ratio"].iloc[i]<=1.0 and df["Ratio"].iloc[i]>=0.9):
if(0 in count_dict):
count_dict[9]+=1
else:
count_dict[9]=1
return count_dict
ff = count_bins(Income_credit)
ff
{2: 61411,
5: 21745,
4: 29151,
1: 65667,
3: 46588,
8: 6843,
7: 9488,
9: 5401,
6: 13881,
0: 9171}
ratios = list(ff.keys())
count = list(ff.values())
AMT_INCOME_TOTAL_AMT_CREDIT = []
for i in ratios:
AMT_INCOME_TOTAL_AMT_CREDIT.append(i / 10)
fig = plt.figure(figsize = (20, 5))
plt.bar(AMT_INCOME_TOTAL_AMT_CREDIT, count, width=0.08)
plt.xlim(0,1,0.1)
plt.xlabel("Income/Credit")
plt.ylabel('Number of borrowers')
plt.title('Number of borrowers with the Income/credit Ratio for target value 0');
plt.show()
plt.figure(figsize=(5, 5))
sns.countplot(data=datasets["application_train"], x="CODE_GENDER")
plt.xlabel('Gender')
plt.ylabel('Number of Borrowers')
plt.show()
sns.catplot(data=datasets["application_train"][datasets["application_train"].TARGET==1],x='CODE_GENDER',kind='count',hue="TARGET");
plt.xlabel('Gender Type')
plt.ylabel('Number of borrowers')
plt.title('Number of borrowers against target value based on Gender')
sns.catplot(data=datasets["application_train"][datasets["application_train"].TARGET==0],x='CODE_GENDER',kind='count',hue="TARGET",palette = ['purple']);
plt.xlabel('Gender Type')
plt.ylabel('Number of borrowers')
plt.title('Number of borrowers against target value based on Gender')
plt.show()
plt.hist(datasets["application_train"]['DAYS_BIRTH'] / -365, edgecolor = 'k', bins = 25)
plt.title('Age of Client'); plt.xlabel('Age (years)'); plt.ylabel('Count');
years=datasets["application_train"][['TARGET','DAYS_BIRTH']]
years['YEARS_BIRTH']=years['DAYS_BIRTH']/-365
years['group']=pd.cut(years['YEARS_BIRTH'],bins=np.linspace(0,50,num=11))
age_groups = years.groupby('group').mean()
age_groups
plt.figure(figsize=(10,10))
plt.bar(age_groups.index.astype(str), 100*age_groups['TARGET'])
plt.xlabel('Age Group (years)')
plt.ylabel('Failure to repay (%)')
plt.title('Failure to repay the loan based on Age group')
plt.show()
sns.countplot(data=datasets["application_train"], x='NAME_FAMILY_STATUS', palette='Purples')
plt.title("Family Status vs Count", fontweight = 'bold', fontsize = 11)
Text(0.5, 1.0, 'Family Status vs Count')
sns.catplot(data=datasets["application_train"][datasets["application_train"].TARGET==1],x='NAME_FAMILY_STATUS',kind='count',hue="TARGET")
plt.xlabel('Family Status')
plt.ylabel('Number of borrowers')
plt.title('Number of borrowers against target value based on Family Status')
plt.xticks(rotation=75)
sns.catplot(data=datasets["application_train"][datasets["application_train"].TARGET==0],x='NAME_FAMILY_STATUS',kind='count',hue="TARGET", palette = ['purple'])
plt.xlabel('Family Status')
plt.ylabel('Number of borrowers')
plt.title('Number of borrowers against target value based on Family Status')
plt.xticks(rotation=75)
plt.show()
fig,ax = plt.subplots(figsize=(10,10))
sns.countplot(x='CNT_CHILDREN', hue = 'TARGET',data=datasets["application_train"])
plt.xlabel("Number of Children")
plt.ylabel('Numbers of borrowers')
plt.title('Number of borrowers against target based on childrens count');
plt.xticks(rotation=70)
plt.show()
sns.countplot(x='OCCUPATION_TYPE', data=datasets["application_train"]);
plt.title('Applicants Occupation');
plt.xticks(rotation=90);
fig,ax = plt.subplots(figsize=(8,8))
sns.countplot(x='TARGET', hue = 'OCCUPATION_TYPE',data=datasets["application_train"])
plt.xlabel("Loan Type")
plt.ylabel('Number of borrowers')
plt.title('Number of borrowers against target based on loan types')
plt.xticks(rotation=70)
plt.show()
sns.countplot(datasets["application_train"], x = 'FLAG_OWN_CAR')
plt.title("Percentage of car owners in the dataset", fontweight = 'bold', fontsize = 11)
Text(0.5, 1.0, 'Percentage of car owners in the dataset')
sns.catplot(data=datasets["application_train"][datasets["application_train"].TARGET==1],x='FLAG_OWN_CAR',kind='count',hue="TARGET")
plt.xlabel('Car Ownership Status')
plt.ylabel('Number of borrowers')
plt.title('Number of borrowers against target value based on car ownership')
plt.xticks(rotation=75)
sns.catplot(data=datasets["application_train"][datasets["application_train"].TARGET==0],x='FLAG_OWN_CAR',kind='count',hue="TARGET", palette = ['purple'])
plt.xlabel('Car Ownership Status')
plt.ylabel('Number of borrowers')
plt.title('Number of borrowers against target value based on car ownership')
plt.xticks(rotation=75)
plt.show()
correlations = datasets["application_train"].corr()['TARGET'].sort_values()
print('Most Positive Correlations:\n', correlations.tail(10))
print('\nMost Negative Correlations:\n', correlations.head(10))
corr_app_train = correlations.reset_index().rename(columns={'index':'Attributes','TARGET':'Correlation'})
corr_app_train
plt.figure(figsize = (10, 5))
sns.barplot(x='Attributes',y='Correlation',data= corr_app_train[corr_app_train.Correlation>0])
plt.xlabel('Attributes')
plt.ylabel('Positive Correlation')
plt.title('Positive Correlated attributes with target')
plt.xticks(rotation=90)
plt.show()
plt.figure(figsize = (30, 5))
sns.barplot(x='Attributes',y='Correlation',data= corr_app_train[corr_app_train.Correlation<=0])
plt.xlabel('Attributes')
plt.ylabel('Negative Correlation')
plt.title('Negative Correlated attributes with target')
plt.xticks(rotation=90)
plt.show()
Most Positive Correlations: FLAG_DOCUMENT_3 0.044346 REG_CITY_NOT_LIVE_CITY 0.044395 FLAG_EMP_PHONE 0.045982 REG_CITY_NOT_WORK_CITY 0.050994 DAYS_ID_PUBLISH 0.051457 DAYS_LAST_PHONE_CHANGE 0.055218 REGION_RATING_CLIENT 0.058899 REGION_RATING_CLIENT_W_CITY 0.060893 DAYS_BIRTH 0.078239 TARGET 1.000000 Name: TARGET, dtype: float64 Most Negative Correlations: EXT_SOURCE_3 -0.178919 EXT_SOURCE_2 -0.160472 EXT_SOURCE_1 -0.155317 DAYS_EMPLOYED -0.044932 FLOORSMAX_AVG -0.044003 FLOORSMAX_MEDI -0.043768 FLOORSMAX_MODE -0.043226 AMT_GOODS_PRICE -0.039645 REGION_POPULATION_RELATIVE -0.037227 ELEVATORS_AVG -0.034199 Name: TARGET, dtype: float64
from pandas.plotting import scatter_matrix
#We can take the top 10 features
top_corr_features = ["TARGET", "REGION_RATING_CLIENT","REGION_RATING_CLIENT_W_CITY","DAYS_LAST_PHONE_CHANGE",
"DAYS_BIRTH", "EXT_SOURCE_1", "EXT_SOURCE_2", "EXT_SOURCE_3", "DAYS_ID_PUBLISH","REG_CITY_NOT_WORK_CITY"]
# scatter_matrix(datasets["application_train"][top_corr_features], figsize=(12, 8));
df = datasets["application_train"].copy()
df2 = df[top_corr_features]
corr = df2.corr()
corr.style.background_gradient(cmap='PuBu').set_precision(2)
| TARGET | REGION_RATING_CLIENT | REGION_RATING_CLIENT_W_CITY | DAYS_LAST_PHONE_CHANGE | DAYS_BIRTH | EXT_SOURCE_1 | EXT_SOURCE_2 | EXT_SOURCE_3 | DAYS_ID_PUBLISH | REG_CITY_NOT_WORK_CITY | |
|---|---|---|---|---|---|---|---|---|---|---|
| TARGET | 1.00 | 0.06 | 0.06 | 0.06 | 0.08 | -0.16 | -0.16 | -0.18 | 0.05 | 0.05 |
| REGION_RATING_CLIENT | 0.06 | 1.00 | 0.95 | 0.03 | 0.01 | -0.12 | -0.29 | -0.01 | -0.01 | 0.01 |
| REGION_RATING_CLIENT_W_CITY | 0.06 | 0.95 | 1.00 | 0.03 | 0.01 | -0.12 | -0.29 | -0.01 | -0.01 | 0.03 |
| DAYS_LAST_PHONE_CHANGE | 0.06 | 0.03 | 0.03 | 1.00 | 0.08 | -0.13 | -0.20 | -0.08 | 0.09 | 0.05 |
| DAYS_BIRTH | 0.08 | 0.01 | 0.01 | 0.08 | 1.00 | -0.60 | -0.09 | -0.21 | 0.27 | 0.24 |
| EXT_SOURCE_1 | -0.16 | -0.12 | -0.12 | -0.13 | -0.60 | 1.00 | 0.21 | 0.19 | -0.13 | -0.19 |
| EXT_SOURCE_2 | -0.16 | -0.29 | -0.29 | -0.20 | -0.09 | 0.21 | 1.00 | 0.11 | -0.05 | -0.08 |
| EXT_SOURCE_3 | -0.18 | -0.01 | -0.01 | -0.08 | -0.21 | 0.19 | 0.11 | 1.00 | -0.13 | -0.08 |
| DAYS_ID_PUBLISH | 0.05 | -0.01 | -0.01 | 0.09 | 0.27 | -0.13 | -0.05 | -0.13 | 1.00 | 0.10 |
| REG_CITY_NOT_WORK_CITY | 0.05 | 0.01 | 0.03 | 0.05 | 0.24 | -0.19 | -0.08 | -0.08 | 0.10 | 1.00 |
most_corr=datasets["application_train"][["REGION_RATING_CLIENT","REGION_RATING_CLIENT_W_CITY","DAYS_LAST_PHONE_CHANGE",
"DAYS_BIRTH", "EXT_SOURCE_1", "EXT_SOURCE_2", "EXT_SOURCE_3", "DAYS_ID_PUBLISH","REG_CITY_NOT_WORK_CITY",'TARGET']]
most_corr_corr = most_corr.corr()
sns.set_style("dark")
sns.set_context("notebook", font_scale=2.0, rc={"lines.linewidth": 1.0})
fig, axes = plt.subplots(figsize = (20,10),sharey=True)
sns.heatmap(most_corr_corr,cmap=plt.cm.RdYlBu_r,vmin=-0.25,vmax=0.6,annot=True)
plt.title('Correlation Heatmap for features with highest correlations with target variables')
Text(0.5, 1.0, 'Correlation Heatmap for features with highest correlations with target variables')
# Import necessary libraries for data preprocessing
import os
import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import MinMaxScaler
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from pandas.plotting import scatter_matrix
# Import necessary libraries for data visualization
import matplotlib.pyplot as plt
import seaborn as sns
# Import necessary libraries for logistic regression
from sklearn.linear_model import LogisticRegression
# Import necessary libraries for model selection and evaluation
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import auc, accuracy_score, confusion_matrix, f1_score, log_loss, classification_report, roc_auc_score, make_scorer
# Ignore warnings
import warnings
warnings.filterwarnings('ignore')
# Import necessary libraries for building and training neural network
import time
from datetime import datetime
import json
import pickle
import copy
import torch
import tensorflow as tf
import torch.nn as nn
import torch.nn.functional as func
from torch.nn.functional import binary_cross_entropy
import torch.optim as optim
from torch.optim import Adam
from torch.utils.data import DataLoader
from torch.utils.tensorboard import SummaryWriter
import keras
from keras.models import Sequential
from keras.layers import Dense, Dropout, BatchNormalization
from tensorflow.keras import layers
from tensorflow.keras.callbacks import LearningRateScheduler
WARNING:tensorflow:From c:\Users\jgdsh\AppData\Local\Programs\Python\Python310\lib\site-packages\keras\src\losses.py:2976: The name tf.losses.sparse_softmax_cross_entropy is deprecated. Please use tf.compat.v1.losses.sparse_softmax_cross_entropy instead.
import tensorflow as tf
print(tf.reduce_sum(tf.random.normal([1000, 1000])))
tf.Tensor(339.9657, shape=(), dtype=float32)
# Import necessary libraries
import time
from datetime import datetime
import json
import pickle
import copy
import warnings
import numpy as np
import pandas as pd
import torch
import tensorflow as tf
import torch.nn as nn
import torch.nn.functional as func
from torch.nn.functional import binary_cross_entropy
import torch.optim as optim
from torch.optim import Adam
from torch.utils.data import DataLoader
from torch.utils.tensorboard import SummaryWriter
from sklearn.preprocessing import LabelEncoder
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline, FeatureUnion, make_pipeline
from sklearn.model_selection import train_test_split, KFold
from sklearn.metrics import auc, accuracy_score, confusion_matrix, f1_score, log_loss, classification_report, roc_auc_score, make_scorer
import keras
from keras.models import Sequential
from keras.layers import Dense, Dropout, BatchNormalization
from tensorflow.keras import layers
from tensorflow.keras.callbacks import LearningRateScheduler
# Ignore warnings
warnings.filterwarnings('ignore')
# our import script contains code for data preprocessing and a neural network model.
DATA_DIR
'../Data/home-credit-default-risk'
datasets['application_train'].shape
(307511, 122)
# Access the 'application_train' dataset from the 'datasets' container
application_train = datasets['application_train']
# Select the minority class instances (TARGET = 1) from the training dataset
minority_application_train = application_train[application_train['TARGET']==1]
# Append a randomly sampled subset of majority class instances (TARGET = 0) to the minority class instances
undersampled_application_train = minority_application_train.append(
application_train[application_train['TARGET']==0].reset_index(drop=True).sample(n = 75000)
)
# Assign the undersampled training dataset to a new key in the 'datasets' dictionary
datasets["undersampled_application_train"] = undersampled_application_train
# Count the number of instances in each class
class_distribution = undersampled_application_train['TARGET'].value_counts()
# Print the class distribution
print("Class distribution in the undersampled training dataset:")
print(class_distribution)
Class distribution in the undersampled training dataset: 0 75000 1 24825 Name: TARGET, dtype: int64
# Assuming this is a dictionary where you store your datasets
# Filtering rows with TARGET == 1 and creating a new DataFrame
datasets["undersampled_application_train_2"] = datasets["application_train"][datasets["application_train"].TARGET == 1].copy()
datasets["undersampled_application_train_2"]['weight'] = 1
# Undersampling Cash loans
num_default_cashloans = len(datasets["undersampled_application_train_2"][(datasets["undersampled_application_train_2"].NAME_CONTRACT_TYPE == 'Cash loans') & (datasets["undersampled_application_train_2"].TARGET == 1)])
df_sample_cash = datasets["application_train"][(datasets["application_train"].NAME_CONTRACT_TYPE == 'Cash loans') & (datasets["application_train"].TARGET == 0)].sample(n=num_default_cashloans, random_state=42)
df_sample_cash['weight'] = 1
# Undersampling Revolving loans
num_default_revolvingloans = len(datasets["undersampled_application_train_2"][(datasets["undersampled_application_train_2"].NAME_CONTRACT_TYPE == 'Revolving loans') & (datasets["undersampled_application_train_2"].TARGET == 1)])
df_sample_revolving = datasets["application_train"][(datasets["application_train"].NAME_CONTRACT_TYPE == 'Revolving loans') & (datasets["application_train"].TARGET == 0)].sample(n=num_default_revolvingloans, random_state=42)
df_sample_revolving['weight'] = 1
# Combining undersampled cash loans and revolving loans with the initial DataFrame
datasets["undersampled_application_train_2"] = pd.concat([datasets["undersampled_application_train_2"], df_sample_cash, df_sample_revolving])
# Check the distribution of the TARGET variable
print(datasets["undersampled_application_train_2"].TARGET.value_counts())
1 24825 0 24825 Name: TARGET, dtype: int64
# Assuming this is a dictionary where you store your datasets
# Filtering rows with TARGET == 1 and creating a new DataFrame
undersampled_application_train_2 = datasets["application_train"][datasets["application_train"].TARGET == 1].copy()
undersampled_application_train_2['weight'] = 1
# Undersampling Cash loans
num_default_cashloans = len(undersampled_application_train_2[(undersampled_application_train_2.NAME_CONTRACT_TYPE == 'Cash loans') & (undersampled_application_train_2.TARGET == 1)])
df_sample_cash = datasets["application_train"][(datasets["application_train"].NAME_CONTRACT_TYPE == 'Cash loans') & (datasets["application_train"].TARGET == 0)].sample(n=num_default_cashloans, random_state=42)
df_sample_cash['weight'] = 1
# Undersampling Revolving loans
num_default_revolvingloans = len(undersampled_application_train_2[(undersampled_application_train_2.NAME_CONTRACT_TYPE == 'Revolving loans') & (undersampled_application_train_2.TARGET == 1)])
df_sample_revolving = datasets["application_train"][(datasets["application_train"].NAME_CONTRACT_TYPE == 'Revolving loans') & (datasets["application_train"].TARGET == 0)].sample(n=num_default_revolvingloans, random_state=42)
df_sample_revolving['weight'] = 1
# Combining undersampled cash loans and revolving loans with the initial DataFrame
undersampled_application_train_2 = pd.concat([undersampled_application_train_2, df_sample_cash, df_sample_revolving])
# Check the distribution of the TARGET variable
print(undersampled_application_train_2.TARGET.value_counts())
1 24825 0 24825 Name: TARGET, dtype: int64
# Create aggregate features (via pipeline)
class FeaturesAggregater(BaseEstimator, TransformerMixin):
def __init__(self, features=None, agg_needed=["mean"]): # no *args or **kargs self.features = features
self.agg_needed = agg_needed
self.agg_op_features = {}
for f in features:
self.agg_op_features[f] = self.agg_needed[:]
def fit(self, X, y=None):
return self
def transform(self, X, y=None):
result = X.groupby(["SK_ID_CURR"]).agg(self.agg_op_features)
df_result = pd.DataFrame()
for x1, x2 in result.columns:
new_col = x1 + "_" + x2
df_result[new_col] = result[x1][x2]
df_result = df_result.reset_index(level=["SK_ID_CURR"])
return df_result
# Access the 'previous_application' dataset from the 'datasets' container and assign it to a variable named 'previous_application_data'
previous_application_data = datasets["previous_application"]
# Apply the 'isna()' method on the 'previous_application_data' DataFrame to detect missing or null values,
# and then apply the 'sum()' method to count the number of missing values in each column of the DataFrame.
missing_values_count_per_column = previous_application_data.isna().sum()
missing_values_count_per_column
SK_ID_PREV 0 SK_ID_CURR 0 NAME_CONTRACT_TYPE 0 AMT_ANNUITY 372235 AMT_APPLICATION 0 AMT_CREDIT 1 AMT_DOWN_PAYMENT 895844 AMT_GOODS_PRICE 385515 WEEKDAY_APPR_PROCESS_START 0 HOUR_APPR_PROCESS_START 0 FLAG_LAST_APPL_PER_CONTRACT 0 NFLAG_LAST_APPL_IN_DAY 0 RATE_DOWN_PAYMENT 895844 RATE_INTEREST_PRIMARY 1664263 RATE_INTEREST_PRIVILEGED 1664263 NAME_CASH_LOAN_PURPOSE 0 NAME_CONTRACT_STATUS 0 DAYS_DECISION 0 NAME_PAYMENT_TYPE 0 CODE_REJECT_REASON 0 NAME_TYPE_SUITE 820405 NAME_CLIENT_TYPE 0 NAME_GOODS_CATEGORY 0 NAME_PORTFOLIO 0 NAME_PRODUCT_TYPE 0 CHANNEL_TYPE 0 SELLERPLACE_AREA 0 NAME_SELLER_INDUSTRY 0 CNT_PAYMENT 372230 NAME_YIELD_GROUP 0 PRODUCT_COMBINATION 346 DAYS_FIRST_DRAWING 673065 DAYS_FIRST_DUE 673065 DAYS_LAST_DUE_1ST_VERSION 673065 DAYS_LAST_DUE 673065 DAYS_TERMINATION 673065 NFLAG_INSURED_ON_APPROVAL 673065 dtype: int64
previous_feature = ["AMT_APPLICATION", "AMT_CREDIT", "AMT_ANNUITY", "approved_credit_ratio", "AMT_ANNUITY_credit_ratio", "Interest_ratio", "LTV_ratio", "SK_ID_PREV", "approved"]
agg_needed = ["min", "max", "mean", "count", "sum"]
agg_needed = ["min", "max", "mean", "count", "sum"]
def previous_feature_aggregation(df, feature, agg_needed):
df['approved_credit_ratio'] = (df['AMT_APPLICATION']/df['AMT_CREDIT']).replace(np.inf, 0)
# installment over credit approved ratio
df['AMT_ANNUITY_credit_ratio'] = (df['AMT_ANNUITY']/df['AMT_CREDIT']).replace(np.inf, 0)
# total interest payment over credit ratio
df['Interest_ratio'] = (df['AMT_ANNUITY']/df['AMT_CREDIT']).replace(np.inf, 0)
# loan cover ratio
df['LTV_ratio'] = (df['AMT_CREDIT']/df['AMT_GOODS_PRICE']).replace(np.inf, 0)
df['approved'] = np.where(df.AMT_CREDIT >0 ,1, 0)
test_pipeline = make_pipeline(FeaturesAggregater(feature, agg_needed))
return(test_pipeline.fit_transform(df))
datasets['previous_application_agg'] = previous_feature_aggregation(datasets["previous_application"], previous_feature, agg_needed)
datasets["previous_application_agg"].isna().sum()
SK_ID_CURR 0 AMT_APPLICATION_min 0 dtype: int64
datasets["installments_payments"].isna().sum()
SK_ID_PREV 0 SK_ID_CURR 0 NUM_INSTALMENT_VERSION 0 NUM_INSTALMENT_NUMBER 0 DAYS_INSTALMENT 0 DAYS_ENTRY_PAYMENT 2905 AMT_INSTALMENT 0 AMT_PAYMENT 2905 dtype: int64
payments_features = ["DAYS_INSTALMENT_DIFF", "AMT_PATMENT_PCT"]
agg_needed = ["mean"]
def payments_feature_aggregation(df, feature, agg_needed):
df['DAYS_INSTALMENT_DIFF'] = df['DAYS_INSTALMENT'] - df['DAYS_ENTRY_PAYMENT']
df['AMT_PATMENT_PCT'] = [x/y if (y != 0) & pd.notnull(y) else np.nan for x,y in zip(df.AMT_PAYMENT,df.AMT_INSTALMENT)]
test_pipeline = make_pipeline(FeaturesAggregater(feature, agg_needed))
return(test_pipeline.fit_transform(df))
datasets['installments_payments_agg'] = payments_feature_aggregation(datasets["installments_payments"], payments_features, agg_needed)
datasets["installments_payments_agg"].isna().sum()
SK_ID_CURR 0 DAYS_INSTALMENT_DIFF_mean 9 dtype: int64
datasets["credit_card_balance"].isna().sum()
SK_ID_PREV 0 SK_ID_CURR 0 MONTHS_BALANCE 0 AMT_BALANCE 0 AMT_CREDIT_LIMIT_ACTUAL 0 AMT_DRAWINGS_ATM_CURRENT 749816 AMT_DRAWINGS_CURRENT 0 AMT_DRAWINGS_OTHER_CURRENT 749816 AMT_DRAWINGS_POS_CURRENT 749816 AMT_INST_MIN_REGULARITY 305236 AMT_PAYMENT_CURRENT 767988 AMT_PAYMENT_TOTAL_CURRENT 0 AMT_RECEIVABLE_PRINCIPAL 0 AMT_RECIVABLE 0 AMT_TOTAL_RECEIVABLE 0 CNT_DRAWINGS_ATM_CURRENT 749816 CNT_DRAWINGS_CURRENT 0 CNT_DRAWINGS_OTHER_CURRENT 749816 CNT_DRAWINGS_POS_CURRENT 749816 CNT_INSTALMENT_MATURE_CUM 305236 NAME_CONTRACT_STATUS 0 SK_DPD 0 SK_DPD_DEF 0 dtype: int64
credit_features = [
"AMT_BALANCE",
"AMT_DRAWINGS_PCT",
"AMT_DRAWINGS_ATM_PCT",
"AMT_DRAWINGS_OTHER_PCT",
"AMT_DRAWINGS_POS_PCT",
"AMT_PRINCIPAL_RECEIVABLE_PCT",
"CNT_DRAWINGS_ATM_CURRENT",
"CNT_DRAWINGS_CURRENT",
"CNT_DRAWINGS_OTHER_CURRENT",
"CNT_DRAWINGS_POS_CURRENT",
"SK_DPD",
"SK_DPD_DEF",
]
agg_needed = ["mean"]
def calculate_pct(x, y):
return x / y if (y != 0) & pd.notnull(y) else np.nan
#def pct(x, y):
#return x / y if (y != 0) & pd.notnull(y) else np.nan
def credit_feature_aggregation(df, feature, agg_needed):
pct_columns = [
("AMT_DRAWINGS_CURRENT", "AMT_DRAWINGS_PCT"),
("AMT_DRAWINGS_ATM_CURRENT", "AMT_DRAWINGS_ATM_PCT"),
("AMT_DRAWINGS_OTHER_CURRENT", "AMT_DRAWINGS_OTHER_PCT"),
("AMT_DRAWINGS_POS_CURRENT", "AMT_DRAWINGS_POS_PCT"),
("AMT_RECEIVABLE_PRINCIPAL", "AMT_PRINCIPAL_RECEIVABLE_PCT"),
]
for col_x, col_pct in pct_columns:
df[col_pct] = [calculate_pct(x, y) for x, y in zip(df[col_x], df["AMT_CREDIT_LIMIT_ACTUAL"])]
pipeline = make_pipeline(FeaturesAggregater(feature, agg_needed))
return pipeline.fit_transform(df)
datasets["credit_card_balance_agg"] = credit_feature_aggregation(
datasets["credit_card_balance"], credit_features, agg_needed
)
datasets.keys()
dict_keys(['application_train', 'application_test', 'bureau', 'bureau_balance', 'credit_card_balance', 'installments_payments', 'previous_application', 'POS_CASH_balance', 'undersampled_application_train', 'undersampled_application_train_2', 'previous_application_agg', 'installments_payments_agg', 'credit_card_balance_agg'])
# Load the train dataset
train_data = datasets["application_train"]
# Compute the distribution of the target variable
target_counts = train_data['TARGET'].value_counts()
# Display the target distribution
print("Target variable distribution:\n")
print(target_counts)
print("\n")
# Compute the percentage of positive and negative examples in the dataset
positive_count = target_counts[1]
negative_count = target_counts[0]
total_count = positive_count + negative_count
positive_percentage = (positive_count / total_count) * 100
negative_percentage = (negative_count / total_count) * 100
# Display the percentages of positive and negative examples
print(f"Percentage of positive examples: {positive_percentage:.2f}%")
print(f"Percentage of negative examples: {negative_percentage:.2f}%")
Target variable distribution: 0 282686 1 24825 Name: TARGET, dtype: int64 Percentage of positive examples: 8.07% Percentage of negative examples: 91.93%
train_dataset= datasets["undersampled_application_train"] #primary dataset
merge_all_data = True
# merge primary table and secondary tables using features based on meta data and aggregage stats
if merge_all_data:
# 1. Join/Merge in prevApps Data
train_dataset = train_dataset.merge(datasets["previous_application_agg"], how='left', on='SK_ID_CURR')
# 2. Join/Merge in Installments Payments Data
train_dataset = train_dataset.merge(datasets["installments_payments_agg"], how='left', on="SK_ID_CURR")
# 3. Join/Merge in Credit Card Balance Data
train_dataset = train_dataset.merge(datasets["credit_card_balance_agg"], how='left', on="SK_ID_CURR")
datasets["undersampled_application_train_4"] = train_dataset
train_dataset.shape
(99825, 125)
train_dataset = datasets["undersampled_application_train_2"]
train_dataset = train_dataset.merge(datasets["previous_application_agg"], how='left', on='SK_ID_CURR')
train_dataset = train_dataset.merge(datasets["installments_payments_agg"], how='left', on="SK_ID_CURR")
train_dataset = train_dataset.merge(datasets["credit_card_balance_agg"], how='left', on="SK_ID_CURR")
train_dataset = train_dataset.drop(columns = 'weight')
datasets["undersampled_application_train_4_2"] = train_dataset
train_dataset.shape
(49650, 125)
train_dataset.to_csv('train_dataset.csv', index=False)
X_kaggle_test= datasets["application_test"]
# merge primary table and secondary tables using features based on meta data and aggregage stats
if merge_all_data:
# 1. Join/Merge in prevApps Data
X_kaggle_test = X_kaggle_test.merge(datasets["previous_application_agg"], how='left', on='SK_ID_CURR')
# 2. Join/Merge in Installments Payments Data
X_kaggle_test = X_kaggle_test.merge(datasets["installments_payments_agg"], how='left', on="SK_ID_CURR")
# 3. Join/Merge in Credit Card Balance Data
X_kaggle_test = X_kaggle_test.merge(datasets["credit_card_balance_agg"], how='left', on="SK_ID_CURR")
X_kaggle_test.shape
(48744, 124)
X_kaggle_test.to_csv('X_kaggle_test_phase4.csv', index=False)
In the previous phase, feature engineering was performed to prepare the dataset for further analysis. This dataset, along with the feature dictionary obtained from hyperparameter tuning of the XGBoost model, will be used in the current phase for further analysis. The train_dataset.csv file used in this phase is a derivative of the training dataset from Phase 3. It is a CSV file that combines undersampled data from various tables, including application train, previous application, installment payments, and credit card balance. The file also includes additional engineered features created in the feature engineering section of Phase 3.
train_dataset = pd.read_csv("train_dataset.csv")
train_dataset.head()
| SK_ID_CURR | TARGET | NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | ... | FLAG_DOCUMENT_21 | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | AMT_APPLICATION_min | DAYS_INSTALMENT_DIFF_mean | AMT_BALANCE_mean | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 100002 | 1 | Cash loans | M | N | Y | 0 | 202500.0 | 406597.5 | 24700.5 | ... | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 179055.0 | 20.421053 | NaN |
| 1 | 100031 | 1 | Cash loans | F | N | Y | 0 | 112500.0 | 979992.0 | 27076.5 | ... | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 2.0 | 2.0 | NaN | NaN | NaN |
| 2 | 100047 | 1 | Cash loans | M | N | Y | 0 | 202500.0 | 1193580.0 | 35028.0 | ... | 0 | 0.0 | 0.0 | 0.0 | 2.0 | 0.0 | 4.0 | 0.0 | 4.100000 | 0.000000 |
| 3 | 100049 | 1 | Cash loans | F | N | N | 0 | 135000.0 | 288873.0 | 16258.5 | ... | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 2.0 | 0.0 | 6.068966 | 48183.296538 |
| 4 | 100096 | 1 | Cash loans | F | N | Y | 0 | 81000.0 | 252000.0 | 14593.5 | ... | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | NaN | NaN | NaN |
5 rows × 125 columns
train_dataset.shape
(49650, 125)
#train_dataset = pd.read_csv("train_dataset.csv")
X_kaggle_test = pd.read_csv("X_kaggle_test_phase4.csv")
X_kaggle_test.head()
| SK_ID_CURR | NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | AMT_GOODS_PRICE | ... | FLAG_DOCUMENT_21 | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | AMT_APPLICATION_min | DAYS_INSTALMENT_DIFF_mean | AMT_BALANCE_mean | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 100001 | Cash loans | F | N | Y | 0 | 135000.0 | 568800.0 | 20560.5 | 450000.0 | ... | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 24835.5 | 7.285714 | NaN |
| 1 | 100005 | Cash loans | M | N | Y | 0 | 99000.0 | 222768.0 | 17370.0 | 180000.0 | ... | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 3.0 | 0.0 | 23.555556 | NaN |
| 2 | 100013 | Cash loans | M | Y | Y | 0 | 202500.0 | 663264.0 | 69777.0 | 630000.0 | ... | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 4.0 | 0.0 | 5.180645 | 18159.919219 |
| 3 | 100028 | Cash loans | F | N | Y | 2 | 315000.0 | 1575000.0 | 49018.5 | 1575000.0 | ... | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 3.0 | 0.0 | 3.000000 | 8085.058163 |
| 4 | 100038 | Cash loans | M | Y | N | 1 | 180000.0 | 625500.0 | 32067.0 | 625500.0 | ... | 0 | NaN | NaN | NaN | NaN | NaN | NaN | 80955.0 | 12.250000 | NaN |
5 rows × 124 columns
# class to select numerical or categorical columns
class DataFrameCreation(BaseEstimator, TransformerMixin):
def __init__(self, attribute_names):
self.attribute_names = attribute_names
def fit(self, X, y=None):
return self
def transform(self, X):
return X[self.attribute_names].values
def pct(x):
return round(100*x,3)
def get_pipeline(dataset, num_cols = None):
numerical_features = []
categorical_features = []
for x in dataset:
if(dataset[x].dtype == np.float64 or dataset[x].dtype == np.int64):
numerical_features.append(x)
else:
categorical_features.append(x)
numerical_features.remove('TARGET')
numerical_features.remove('SK_ID_CURR')
categorical_pipeline = Pipeline([
('selector', DataFrameCreation(categorical_features)),
('imputer', SimpleImputer(strategy='most_frequent')),
('ohe', OneHotEncoder(sparse=False, handle_unknown="ignore"))
])
# If columns are provided, we use only pass those columns to the model
if num_cols == None:
final_numerical_features = numerical_features
else:
final_numerical_features = num_cols
numerical_pipeline = Pipeline([
('selector', DataFrameCreation(final_numerical_features)),
('imputer', SimpleImputer(strategy='mean')),
('std_scaler', StandardScaler()),
])
data_pipeline = FeatureUnion(transformer_list=[
("numerical_pipeline", numerical_pipeline),
("categorical_pipeline", categorical_pipeline),
])
selected_features = final_numerical_features + categorical_features + ["SK_ID_CURR"]
tot_features = f"{len(selected_features)}: Num:{len(final_numerical_features)}, Cat:{len(categorical_features)}"
print('Total Features:', tot_features)
return data_pipeline, selected_features
data_pipeline, selected_features = get_pipeline(train_dataset)
Total Features: 124: Num:107, Cat:16
y_train = train_dataset['TARGET']
X_train = train_dataset[selected_features]
X_train, X_test, y_train, y_test = train_test_split(X_train, y_train, test_size=0.2, random_state=42)
print(f"X train shape: {X_train.shape}")
print(f"X test shape: {X_test.shape}")
X train shape: (39720, 124) X test shape: (9930, 124)
torch.cuda.is_available()
True
print(torch.version.cuda)
11.8
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# Handling missing values and standardizing the data
X_train_std = data_pipeline.fit_transform(X_train)
X_test_std = data_pipeline.transform(X_test)
X_kaggle_test_std = data_pipeline.transform(X_kaggle_test)
# Converting numpy arrays into float tensors using gpu device
X_train_tensor = torch.FloatTensor(X_train_std).to(device)
X_test_tensor = torch.FloatTensor(X_test_std).to(device)
X_kaggle_test_tensor = torch.FloatTensor(X_kaggle_test_std).to(device)
# Converting numpy arrays to float tensors and reshaping y_train and y_test
y_train_tensor = torch.FloatTensor(y_train.to_numpy()).to(device)
y_train_tensor = y_train_tensor.reshape(-1, 1)
y_test_tensor = torch.FloatTensor(y_test.to_numpy()).to(device)
y_test_tensor = y_test_tensor.reshape(-1, 1)
X_train_tensor.shape, X_test_tensor.shape, X_kaggle_test_tensor.shape
(torch.Size([39720, 245]), torch.Size([9930, 245]), torch.Size([48744, 245]))
# Loading features and importances from phase3
with open("features_dict_XG.pickle", 'rb') as handle:
features_dict = pickle.load(handle)
# selecting features with importance values > 0
features = features_dict['features']
importances = features_dict['importances']
new_indices = [idx for idx, x in enumerate(importances) if x > 0]
new_importances = [x for idx, x in enumerate(importances) if x > 0]
new_features = [features[i] for i in new_indices]
# creating pipeline by joining numerical and categorical pipelines
num_attribs = new_features
data_pipeline, selected_features = get_pipeline(train_dataset, num_attribs)
# splitting the dataset into train and test datasets with selected features
y_train_sel, X_train_sel = train_dataset['TARGET'], train_dataset[selected_features]
X_kaggle_test_sel = X_kaggle_test[selected_features]
X_train_sel, X_test_sel, y_train_sel, y_test_sel = train_test_split(X_train_sel, y_train_sel, test_size=0.2, random_state=42)
# Handling missing values and standardizing the data using pipeline
X_train_sel_std, X_test_sel_std, X_kaggle_test_sel_std = data_pipeline.fit_transform(X_train_sel), data_pipeline.transform(X_test_sel), data_pipeline.transform(X_kaggle_test_sel)
# Generating float tensors from numpy arrays using GPU device
X_train_sel_tensor, X_test_sel_tensor, X_kaggle_test_sel_tensor = map(lambda x: torch.FloatTensor(x).to(device), (X_train_sel_std, X_test_sel_std, X_kaggle_test_sel_std))
y_train_sel_tensor, y_test_sel_tensor = map(lambda x: torch.FloatTensor(x.to_numpy()).reshape(-1, 1).to(device), (y_train_sel, y_test_sel))
# Print the shapes of tensors
print(f"X train selected shape: {X_train_sel_tensor.shape}")
print(f"X test selected shape: {X_test_sel_tensor.shape}")
Total Features: 113: Num:96, Cat:16 X train selected shape: torch.Size([39720, 234]) X test selected shape: torch.Size([9930, 234])
%matplotlib inline
writer = SummaryWriter()
The performance of a classification model is assessed by calculating the area under the ROC curve (AUC). The AUC quantifies the ability of the model to distinguish between positive and negative instances. The Sklearn roc_auc_score function is commonly used to compute the AUC, which effectively condenses the information provided by the ROC curve into a single numerical value.
The performance of classification models is assessed based on the area under the ROC curve, which measures the ability of the model to differentiate between the predicted probability and the observed target.
The SkLearn roc_auc_score function computes the area under the re. The Sklearn roc_auc_score function calculates the AUC, condensing the information from the ROC curve into a single numerical value.
from sklearn.metrics import roc_auc_score
>>> y_true = np.array([0, 0, 1, 1])
>>> y_scores = np.array([0.1, 0.4, 0.35, 0.8])
>>> roc_auc_score(y_true, y_scores)
0.75
It represents the percentage of correctly classified data points among all data points.
$$ \operatorname{Accuracy} = \frac{TN+TP}{TN+FP+TP+FN}\ $$Precision indicates the proportion of instances correctly identified as positive among all instances labeled as positive. It is calculated by dividing the number of true positives by the total number of predicted positives.
$$ \operatorname{Precision} = \frac{TP}{TP+FP}\ $$Precision indicates the proportion of positive cases that are accurately identified as positive. It is interchangeably referred to as the True Positive Rate (TPR).
$$ \operatorname{Recall} = \frac{TP}{TP+FN}\ $$The F1-score strikes a balance between accuracy and recall, considering both false positives and false negatives. It serves as a valuable metric for assessing model performance on imbalanced datasets.
$$ \operatorname{F1Score} = \frac{Precision * Recall}{Precision + Recall}\ $$The Area Under the Curve (AUC) metric serves as a performance evaluation tool for binary classification models by quantifying the area enclosed by the Receiver Operating Characteristic (ROC) curve. This metric provides a single numerical value encapsulating the overall model performance across all potential classification thresholds. AUC is a prevalent metric in machine learning due to its resilience against class imbalance and indifference to the specific classification threshold employed. Higher AUC values signify superior model performance.
try:
expLog
except NameError:
expLog = pd.DataFrame(columns=["exp_name", "learning_rate", "epochs",
"Train Time (sec)",
"Test Time (sec)",
"Train Acc",
"Test Acc",
"Train AUC",
"Test AUC",
"Train F1",
"Test F1"
])
The binary cross-entropy loss function will be utilized by this MLP class.
$$ CXE = -\frac{1}{m}\sum \limits_{i=1}^m (y_i \cdot log(p_i) + (1-y_i)\cdot log(1-p_i)) $$from sklearn.metrics import f1_score
def get_results(expLog, exp_name, learning_rate, epochs, model, train_time, test_time, X_train, y_train, X_test, y_test):
def test_metrics(X, y, model):
X = X.to(device) # Move the input tensor to the GPU
model.eval()
with torch.no_grad():
y_prob = model(X)
y_pred = y_prob.cpu().detach().numpy().round()
roc_auc = roc_auc_score(y, y_pred)
accuracy = accuracy_score(y, y_pred)
f1 = f1_score(y, y_pred)
return accuracy, roc_auc, f1
# Getting the results
accuracy_train, roc_auc_train, f1_train = test_metrics(X_train, y_train, model)
accuracy_test, roc_auc_test, f1_test = test_metrics(X_test, y_test, model)
expLog.loc[len(expLog)] = [f"{exp_name}"] + list(np.round(
[learning_rate, epochs, train_time, test_time,
accuracy_train, accuracy_test, roc_auc_train, roc_auc_test, f1_train, f1_test],
4))
return expLog
from sklearn.metrics import f1_score
def train_and_test(X_train_tensor, y_train_tensor, X_test_tensor, y_test_tensor, model, optimizer, writer, learning_rate=0.01, epochs=1000, device='cuda'):
# Move tensors to the GPU
X_train_tensor = X_train_tensor.to(device)
y_train_tensor = y_train_tensor.to(device)
X_test_tensor = X_test_tensor.to(device)
# Model to be trained on GPU
model = model.to(device)
print('Model Architecture:')
print(model, '\n')
print('Training the model:')
model.train()
for epoch_id in range(epochs):
y_prob = model(X_train_tensor)
loss = binary_cross_entropy(y_prob, y_train_tensor)
writer.add_scalar("Train Loss", loss, epoch_id+1)
optimizer.zero_grad()
loss.backward()
optimizer.step()
if epoch_id % 50 == 49:
print(f"Epoch {epoch_id + 1}:")
show_metrics(y_train_tensor, y_prob, epoch_id+1, writer)
writer.flush()
writer.close()
print()
# Testing the model
model.eval()
with torch.no_grad():
y_test_pred_prob = model(X_test_tensor)
y_test_tensor = y_test_tensor.to(device)
print('Test data:')
show_metrics(y_test_tensor, y_test_pred_prob, writer=None)
def show_metrics(y_true, y_prob, idx=0, writer=None):
y_pred = y_prob.cpu().detach().numpy().round()
# Move tensors to the CPU
y_true = y_true.cpu()
# Calculating metrics from actual and predicted values
roc_auc = roc_auc_score(y_true, y_pred)
accuracy = accuracy_score(y_true, y_pred)
f1 = f1_score(y_true, y_pred)
if writer:
# Adding info to tensorboard
writer.add_scalar("Train ROC_AUC", roc_auc, idx)
writer.add_scalar("Train Accuracy", accuracy, idx)
writer.add_scalar("Train F1", f1, idx)
# Printing accuracy, ROC_AUC, and F1 for reference
print(f'Accuracy : {round(accuracy,4)} ; ROC_AUC : {round(roc_auc, 4)} ; F1 : {round(f1, 4)}')
def train_and_test(X_train_tensor, y_train_tensor, X_test_tensor, y_test_tensor, model, optimizer, writer, learning_rate=0.01, epochs=1000, device='cuda'):
# Move tensors to the GPU
X_train_tensor = X_train_tensor.to(device)
y_train_tensor = y_train_tensor.to(device)
X_test_tensor = X_test_tensor.to(device)
y_test_tensor = y_test_tensor.to(device)
# Model to be trained on GPU
model = model.to(device)
print('Model Architecture:')
print(model, '\n')
print('Training the model:')
model.train()
for epoch_id in range(epochs):
y_prob = model(X_train_tensor)
loss = binary_cross_entropy(y_prob, y_train_tensor)
writer.add_scalar("Train Loss", loss, epoch_id+1)
optimizer.zero_grad()
loss.backward()
optimizer.step()
if epoch_id % 50 == 49:
print(f"Epoch {epoch_id + 1}:")
show_metrics(y_train_tensor, y_prob, epoch_id+1, writer)
writer.flush()
writer.close()
print()
# Testing the model
model.eval()
with torch.no_grad():
y_test_pred_prob = model(X_test_tensor)
print('Test data:')
show_metrics(y_test_tensor, y_test_pred_prob, writer=None)
def show_metrics(y_true, y_prob, idx=0, writer=None):
y_pred = y_prob.cpu().detach().numpy().round()
# Move tensors to the CPU
y_true = y_true.cpu()
# Calculating metrics from actual and predicted values
roc_auc = roc_auc_score(y_true.cpu().numpy(), y_pred)
accuracy = accuracy_score(y_true.cpu().numpy(), y_pred)
f1 = f1_score(y_true.cpu().numpy(), y_pred)
if writer:
# Adding info to tensorboard
writer.add_scalar("Train ROC_AUC", roc_auc, idx)
writer.add_scalar("Train Accuracy", accuracy, idx)
writer.add_scalar("Train F1", f1, idx)
# Printing accuracy, ROC_AUC, and F1 for reference
print(f'Accuracy : {round(accuracy,4)} ; ROC_AUC : {round(roc_auc, 4)} ; F1 : {round(f1, 4)}')
Using the HCDR data, we will replicate the preprocessing and feature engineering steps implemented in phase 3. Subsequently, we will adopt the same feature selection method and feature dictionary as in phase 3. Next, we will construct three MLP models with varying degrees of complexity and depth. Following this, we will identify the model that demonstrates the best performance and conduct hyperparameter tuning. We will then consolidate and analyze the results to select the model that yields the optimal F1 and AUC scores. Finally, we will submit the chosen model for evaluation on Kaggle.
The neural network model, designed using PyTorch, a widely used deep learning framework, comprises a single layer with a linear transformation and a sigmoid activation function. The model's input and output dimensions are tailored to the training data's structure. The input dimension aligns with the training data's number of columns, while the output dimension is set to 1, suitable for a binary classification task.
import torch
import torch.nn as nn
# Define input and output dimensions
dim_input = X_train_tensor.shape[1]
dim_output = 1
# Define the model architecture
model1 = torch.nn.Sequential(
torch.nn.Linear(dim_input, dim_output),
nn.Sigmoid()
)
from torchsummary import summary
# Print summary of model architecture
summary(model1, input_size=(X_train_tensor.shape[1],), device='cpu')
----------------------------------------------------------------
Layer (type) Output Shape Param #
================================================================
Linear-1 [-1, 1] 246
Sigmoid-2 [-1, 1] 0
================================================================
Total params: 246
Trainable params: 246
Non-trainable params: 0
----------------------------------------------------------------
Input size (MB): 0.00
Forward/backward pass size (MB): 0.00
Params size (MB): 0.00
Estimated Total Size (MB): 0.00
----------------------------------------------------------------
import time
import numpy as np
from torch.optim import Adam
model = model1
learning_rate = 0.01
epochs = 1000
optimizer = Adam(model.parameters(), learning_rate)
y_test_tensor = torch.tensor(y_test.values, dtype=torch.float32)
y_test=y_test_tensor
# Training the model
start_time = time.time()
train_and_test(X_train_tensor, y_train_tensor, X_test_tensor, y_test, model, optimizer, writer, learning_rate, epochs)
train_time = np.round(time.time() - start_time, 4)
# Testing the model
start_time = time.time()
train_and_test(X_train_tensor, y_train_tensor, X_test_tensor, y_test, model, optimizer, writer, learning_rate, epochs)
test_time = np.round(time.time() - start_time, 4)
print(f'Training time: {train_time} seconds')
print(f'Testing time: {test_time} seconds')
Model Architecture: Sequential( (0): Linear(in_features=245, out_features=1, bias=True) (1): Sigmoid() ) Training the model: Epoch 50: Accuracy : 0.686 ; ROC_AUC : 0.686 ; F1 : 0.6849 Epoch 100: Accuracy : 0.6879 ; ROC_AUC : 0.6879 ; F1 : 0.6869 Epoch 150: Accuracy : 0.6891 ; ROC_AUC : 0.6891 ; F1 : 0.6882 Epoch 200: Accuracy : 0.6898 ; ROC_AUC : 0.6898 ; F1 : 0.6889 Epoch 250: Accuracy : 0.6897 ; ROC_AUC : 0.6897 ; F1 : 0.6889 Epoch 300: Accuracy : 0.69 ; ROC_AUC : 0.69 ; F1 : 0.6892 Epoch 350: Accuracy : 0.6902 ; ROC_AUC : 0.6902 ; F1 : 0.6893 Epoch 400: Accuracy : 0.69 ; ROC_AUC : 0.69 ; F1 : 0.6891 Epoch 450: Accuracy : 0.69 ; ROC_AUC : 0.69 ; F1 : 0.6891 Epoch 500: Accuracy : 0.6903 ; ROC_AUC : 0.6903 ; F1 : 0.6895 Epoch 550: Accuracy : 0.6905 ; ROC_AUC : 0.6905 ; F1 : 0.6897 Epoch 600: Accuracy : 0.6908 ; ROC_AUC : 0.6908 ; F1 : 0.6899 Epoch 650: Accuracy : 0.6907 ; ROC_AUC : 0.6907 ; F1 : 0.6899 Epoch 700: Accuracy : 0.6905 ; ROC_AUC : 0.6905 ; F1 : 0.6897 Epoch 750: Accuracy : 0.6906 ; ROC_AUC : 0.6906 ; F1 : 0.6899 Epoch 800: Accuracy : 0.6907 ; ROC_AUC : 0.6907 ; F1 : 0.69 Epoch 850: Accuracy : 0.6905 ; ROC_AUC : 0.6905 ; F1 : 0.6898 Epoch 900: Accuracy : 0.6906 ; ROC_AUC : 0.6906 ; F1 : 0.6898 Epoch 950: Accuracy : 0.6906 ; ROC_AUC : 0.6906 ; F1 : 0.6899 Epoch 1000: Accuracy : 0.6907 ; ROC_AUC : 0.6907 ; F1 : 0.6899 Test data: Accuracy : 0.6815 ; ROC_AUC : 0.6815 ; F1 : 0.6818 Model Architecture: Sequential( (0): Linear(in_features=245, out_features=1, bias=True) (1): Sigmoid() ) Training the model: Epoch 50: Accuracy : 0.6908 ; ROC_AUC : 0.6908 ; F1 : 0.6902 Epoch 100: Accuracy : 0.6907 ; ROC_AUC : 0.6907 ; F1 : 0.69 Epoch 150: Accuracy : 0.6907 ; ROC_AUC : 0.6907 ; F1 : 0.6899 Epoch 200: Accuracy : 0.6907 ; ROC_AUC : 0.6907 ; F1 : 0.6899 Epoch 250: Accuracy : 0.6906 ; ROC_AUC : 0.6906 ; F1 : 0.6899 Epoch 300: Accuracy : 0.6901 ; ROC_AUC : 0.6901 ; F1 : 0.6873 Epoch 350: Accuracy : 0.6906 ; ROC_AUC : 0.6906 ; F1 : 0.6898 Epoch 400: Accuracy : 0.6907 ; ROC_AUC : 0.6907 ; F1 : 0.69 Epoch 450: Accuracy : 0.6907 ; ROC_AUC : 0.6907 ; F1 : 0.69 Epoch 500: Accuracy : 0.6907 ; ROC_AUC : 0.6907 ; F1 : 0.69 Epoch 550: Accuracy : 0.6909 ; ROC_AUC : 0.6909 ; F1 : 0.6903 Epoch 600: Accuracy : 0.6909 ; ROC_AUC : 0.6909 ; F1 : 0.6903 Epoch 650: Accuracy : 0.6909 ; ROC_AUC : 0.6909 ; F1 : 0.6902 Epoch 700: Accuracy : 0.6908 ; ROC_AUC : 0.6908 ; F1 : 0.6901 Epoch 750: Accuracy : 0.6904 ; ROC_AUC : 0.6904 ; F1 : 0.6896 Epoch 800: Accuracy : 0.6911 ; ROC_AUC : 0.6911 ; F1 : 0.6905 Epoch 850: Accuracy : 0.6909 ; ROC_AUC : 0.6909 ; F1 : 0.6902 Epoch 900: Accuracy : 0.6909 ; ROC_AUC : 0.6909 ; F1 : 0.6903 Epoch 950: Accuracy : 0.6901 ; ROC_AUC : 0.6901 ; F1 : 0.6889 Epoch 1000: Accuracy : 0.6909 ; ROC_AUC : 0.6909 ; F1 : 0.6901 Test data: Accuracy : 0.6822 ; ROC_AUC : 0.6822 ; F1 : 0.6828 Training time: 2.8127 seconds Testing time: 2.2274 seconds
exp_name = f"Model1 All"
expLog = get_results(expLog, exp_name, learning_rate, epochs, model, train_time, test_time, X_train_tensor, y_train, X_test_tensor, y_test)
expLog
| exp_name | learning_rate | epochs | Train Time (sec) | Test Time (sec) | Train Acc | Test Acc | Train AUC | Test AUC | Train F1 | Test F1 | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Model1 All | 0.01 | 1000.0 | 2.8127 | 2.2274 | 0.6909 | 0.6822 | 0.6909 | 0.6822 | 0.6903 | 0.6828 |
%load_ext tensorboard
The tensorboard extension is already loaded. To reload it, use: %reload_ext tensorboard
tensorboard --logdir=runs
Reusing TensorBoard on port 6006 (pid 12332), started 2 days, 5:03:43 ago. (Use '!kill 12332' to kill it.)
dim_input = X_train_sel_tensor.shape[1]
dim_output = 1
model1 = torch.nn.Sequential(
torch.nn.Linear(dim_input, dim_output),
nn.Sigmoid())
model = model1
learning_rate = 0.01
epochs = 1000
optimizer = Adam(model.parameters(), learning_rate)
y_test_tensor = torch.tensor(y_test_sel.values, dtype=torch.float32)
y_test_sel=y_test_tensor
# Training the model
start_time = time.time()
train_and_test(X_train_sel_tensor, y_train_sel_tensor, X_test_sel_tensor, y_test_sel, model, optimizer, writer, learning_rate, epochs)
train_time = np.round(time.time() - start_time, 4)
# Testing the model
start_time = time.time()
train_and_test(X_train_sel_tensor, y_train_sel_tensor, X_test_sel_tensor, y_test_sel, model, optimizer, writer, learning_rate, epochs)
test_time = np.round(time.time() - start_time, 4)
print(f'Training time: {train_time} seconds')
print(f'Testing time: {test_time} seconds')
Model Architecture: Sequential( (0): Linear(in_features=234, out_features=1, bias=True) (1): Sigmoid() ) Training the model: Epoch 50: Accuracy : 0.6864 ; ROC_AUC : 0.6864 ; F1 : 0.686 Epoch 100: Accuracy : 0.6884 ; ROC_AUC : 0.6884 ; F1 : 0.6878 Epoch 150: Accuracy : 0.6887 ; ROC_AUC : 0.6887 ; F1 : 0.6879 Epoch 200: Accuracy : 0.6891 ; ROC_AUC : 0.6891 ; F1 : 0.6883 Epoch 250: Accuracy : 0.6897 ; ROC_AUC : 0.6897 ; F1 : 0.6887 Epoch 300: Accuracy : 0.6897 ; ROC_AUC : 0.6897 ; F1 : 0.6888 Epoch 350: Accuracy : 0.6897 ; ROC_AUC : 0.6897 ; F1 : 0.6889 Epoch 400: Accuracy : 0.6898 ; ROC_AUC : 0.6898 ; F1 : 0.6889 Epoch 450: Accuracy : 0.6898 ; ROC_AUC : 0.6898 ; F1 : 0.6888 Epoch 500: Accuracy : 0.6898 ; ROC_AUC : 0.6898 ; F1 : 0.6889 Epoch 550: Accuracy : 0.6896 ; ROC_AUC : 0.6896 ; F1 : 0.6887 Epoch 600: Accuracy : 0.6899 ; ROC_AUC : 0.6899 ; F1 : 0.689 Epoch 650: Accuracy : 0.69 ; ROC_AUC : 0.69 ; F1 : 0.6892 Epoch 700: Accuracy : 0.69 ; ROC_AUC : 0.69 ; F1 : 0.6892 Epoch 750: Accuracy : 0.6901 ; ROC_AUC : 0.6901 ; F1 : 0.6892 Epoch 800: Accuracy : 0.6901 ; ROC_AUC : 0.6901 ; F1 : 0.6893 Epoch 850: Accuracy : 0.6899 ; ROC_AUC : 0.6899 ; F1 : 0.6891 Epoch 900: Accuracy : 0.6898 ; ROC_AUC : 0.6898 ; F1 : 0.6891 Epoch 950: Accuracy : 0.6899 ; ROC_AUC : 0.6899 ; F1 : 0.6891 Epoch 1000: Accuracy : 0.6899 ; ROC_AUC : 0.6899 ; F1 : 0.6892 Test data: Accuracy : 0.6807 ; ROC_AUC : 0.6807 ; F1 : 0.6807 Model Architecture: Sequential( (0): Linear(in_features=234, out_features=1, bias=True) (1): Sigmoid() ) Training the model: Epoch 50: Accuracy : 0.69 ; ROC_AUC : 0.69 ; F1 : 0.6893 Epoch 100: Accuracy : 0.6901 ; ROC_AUC : 0.6901 ; F1 : 0.6894 Epoch 150: Accuracy : 0.6902 ; ROC_AUC : 0.6902 ; F1 : 0.6894 Epoch 200: Accuracy : 0.6902 ; ROC_AUC : 0.6902 ; F1 : 0.6894 Epoch 250: Accuracy : 0.6902 ; ROC_AUC : 0.6902 ; F1 : 0.6895 Epoch 300: Accuracy : 0.6902 ; ROC_AUC : 0.6902 ; F1 : 0.6895 Epoch 350: Accuracy : 0.6902 ; ROC_AUC : 0.6902 ; F1 : 0.6895 Epoch 400: Accuracy : 0.6902 ; ROC_AUC : 0.6902 ; F1 : 0.6895 Epoch 450: Accuracy : 0.6902 ; ROC_AUC : 0.6902 ; F1 : 0.6896 Epoch 500: Accuracy : 0.6903 ; ROC_AUC : 0.6903 ; F1 : 0.6896 Epoch 550: Accuracy : 0.6902 ; ROC_AUC : 0.6902 ; F1 : 0.6896 Epoch 600: Accuracy : 0.6902 ; ROC_AUC : 0.6902 ; F1 : 0.6885 Epoch 650: Accuracy : 0.6904 ; ROC_AUC : 0.6904 ; F1 : 0.6899 Epoch 700: Accuracy : 0.6904 ; ROC_AUC : 0.6904 ; F1 : 0.6898 Epoch 750: Accuracy : 0.6903 ; ROC_AUC : 0.6903 ; F1 : 0.6896 Epoch 800: Accuracy : 0.6904 ; ROC_AUC : 0.6904 ; F1 : 0.6897 Epoch 850: Accuracy : 0.6898 ; ROC_AUC : 0.6898 ; F1 : 0.6877 Epoch 900: Accuracy : 0.6904 ; ROC_AUC : 0.6904 ; F1 : 0.6898 Epoch 950: Accuracy : 0.6903 ; ROC_AUC : 0.6903 ; F1 : 0.6897 Epoch 1000: Accuracy : 0.6904 ; ROC_AUC : 0.6904 ; F1 : 0.6897 Test data: Accuracy : 0.6817 ; ROC_AUC : 0.6817 ; F1 : 0.6821 Training time: 2.2147 seconds Testing time: 2.0242 seconds
exp_name = f"Model1 selected"
expLog = get_results(expLog, exp_name, learning_rate, epochs, model, train_time, test_time, X_train_sel_tensor, y_train_sel, X_test_sel_tensor, y_test_sel)
expLog
| exp_name | learning_rate | epochs | Train Time (sec) | Test Time (sec) | Train Acc | Test Acc | Train AUC | Test AUC | Train F1 | Test F1 | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Model1 All | 0.01 | 1000.0 | 2.8127 | 2.2274 | 0.6909 | 0.6822 | 0.6909 | 0.6822 | 0.6903 | 0.6828 |
| 1 | Model1 selected | 0.01 | 1000.0 | 2.2147 | 2.0242 | 0.6904 | 0.6817 | 0.6904 | 0.6817 | 0.6899 | 0.6821 |
%reload_ext tensorboard
tensorboard --logdir=runs
Reusing TensorBoard on port 6006 (pid 12332), started 2 days, 5:03:47 ago. (Use '!kill 12332' to kill it.)
Model 2 utilizes PyTorch to implement a Multi-Layer Perceptron (MLP) architecture that incorporates batch normalization and dropout regularization to mitigate overfitting. The MLP structure comprises six hidden layers with 512, 256, 128, 64, 32, and 1 neurons, respectively. The input size is defined during model initialization. The hidden layers employ the rectified linear unit (ReLU) activation function, while the output layer utilizes the sigmoid function. To prevent overfitting, a dropout rate of 0.5 is applied, resulting in the random deactivation of 50% of the hidden layer neurons during training.
import torch.nn as nn
class EnhancedMLP(nn.Module):
def __init__(self, input_size):
super(EnhancedMLP, self).__init__()
self.hl1 = nn.Linear(input_size, 512)
self.bn1 = nn.BatchNorm1d(512)
self.hl2 = nn.Linear(512, 256)
self.bn2 = nn.BatchNorm1d(256)
self.hl3 = nn.Linear(256, 128)
self.bn3 = nn.BatchNorm1d(128)
self.hl4 = nn.Linear(128, 64)
self.bn4 = nn.BatchNorm1d(64)
self.hl5 = nn.Linear(64, 32)
self.bn5 = nn.BatchNorm1d(32)
self.hl6 = nn.Linear(32, 1)
self.activation = nn.ReLU()
self.sigmoid = nn.Sigmoid()
self.dropout = nn.Dropout(0.5)
def forward(self, x):
x = self.activation(self.bn1(self.hl1(x)))
x = self.dropout(x)
x = self.activation(self.bn2(self.hl2(x)))
x = self.dropout(x)
x = self.activation(self.bn3(self.hl3(x)))
x = self.dropout(x)
x = self.activation(self.bn4(self.hl4(x)))
x = self.dropout(x)
x = self.activation(self.bn5(self.hl5(x)))
x = self.sigmoid(self.hl6(x))
return x
model2 = EnhancedMLP(X_train_tensor.shape[1])
from torchsummary import summary
# Print summary of model architecture
summary(model2, input_size=(X_train_tensor.shape[1],), device='cpu')
----------------------------------------------------------------
Layer (type) Output Shape Param #
================================================================
Linear-1 [-1, 512] 125,952
BatchNorm1d-2 [-1, 512] 1,024
ReLU-3 [-1, 512] 0
Dropout-4 [-1, 512] 0
Linear-5 [-1, 256] 131,328
BatchNorm1d-6 [-1, 256] 512
ReLU-7 [-1, 256] 0
Dropout-8 [-1, 256] 0
Linear-9 [-1, 128] 32,896
BatchNorm1d-10 [-1, 128] 256
ReLU-11 [-1, 128] 0
Dropout-12 [-1, 128] 0
Linear-13 [-1, 64] 8,256
BatchNorm1d-14 [-1, 64] 128
ReLU-15 [-1, 64] 0
Dropout-16 [-1, 64] 0
Linear-17 [-1, 32] 2,080
BatchNorm1d-18 [-1, 32] 64
ReLU-19 [-1, 32] 0
Linear-20 [-1, 1] 33
Sigmoid-21 [-1, 1] 0
================================================================
Total params: 302,529
Trainable params: 302,529
Non-trainable params: 0
----------------------------------------------------------------
Input size (MB): 0.00
Forward/backward pass size (MB): 0.03
Params size (MB): 1.15
Estimated Total Size (MB): 1.19
----------------------------------------------------------------
model = model2
learning_rate = 0.01
epochs = 1000
optimizer = Adam(model.parameters(), learning_rate)
# Training the model
start_time = time.time()
train_and_test(X_train_tensor, y_train_tensor, X_test_tensor, y_test, model, optimizer, writer, learning_rate, epochs)
train_time = np.round(time.time() - start_time, 4)
# Testing the model
start_time = time.time()
train_and_test(X_train_tensor, y_train_tensor, X_test_tensor, y_test, model, optimizer, writer, learning_rate, epochs)
test_time = np.round(time.time() - start_time, 4)
print(f'Training time: {train_time} seconds')
print(f'Testing time: {test_time} seconds')
Model Architecture: EnhancedMLP( (hl1): Linear(in_features=245, out_features=512, bias=True) (bn1): BatchNorm1d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (hl2): Linear(in_features=512, out_features=256, bias=True) (bn2): BatchNorm1d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (hl3): Linear(in_features=256, out_features=128, bias=True) (bn3): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (hl4): Linear(in_features=128, out_features=64, bias=True) (bn4): BatchNorm1d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (hl5): Linear(in_features=64, out_features=32, bias=True) (bn5): BatchNorm1d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (hl6): Linear(in_features=32, out_features=1, bias=True) (activation): ReLU() (sigmoid): Sigmoid() (dropout): Dropout(p=0.5, inplace=False) ) Training the model: Epoch 50: Accuracy : 0.7278 ; ROC_AUC : 0.7278 ; F1 : 0.7275 Epoch 100: Accuracy : 0.7788 ; ROC_AUC : 0.7788 ; F1 : 0.7777 Epoch 150: Accuracy : 0.8198 ; ROC_AUC : 0.8198 ; F1 : 0.822 Epoch 200: Accuracy : 0.8414 ; ROC_AUC : 0.8414 ; F1 : 0.8446 Epoch 250: Accuracy : 0.8607 ; ROC_AUC : 0.8607 ; F1 : 0.8649 Epoch 300: Accuracy : 0.8774 ; ROC_AUC : 0.8774 ; F1 : 0.8786 Epoch 350: Accuracy : 0.8884 ; ROC_AUC : 0.8884 ; F1 : 0.8879 Epoch 400: Accuracy : 0.8918 ; ROC_AUC : 0.8918 ; F1 : 0.8911 Epoch 450: Accuracy : 0.9006 ; ROC_AUC : 0.9006 ; F1 : 0.8991 Epoch 500: Accuracy : 0.8986 ; ROC_AUC : 0.8986 ; F1 : 0.8981 Epoch 550: Accuracy : 0.9084 ; ROC_AUC : 0.9084 ; F1 : 0.9092 Epoch 600: Accuracy : 0.9114 ; ROC_AUC : 0.9114 ; F1 : 0.9114 Epoch 650: Accuracy : 0.918 ; ROC_AUC : 0.918 ; F1 : 0.9178 Epoch 700: Accuracy : 0.9169 ; ROC_AUC : 0.9169 ; F1 : 0.9164 Epoch 750: Accuracy : 0.9211 ; ROC_AUC : 0.9211 ; F1 : 0.9209 Epoch 800: Accuracy : 0.9211 ; ROC_AUC : 0.9211 ; F1 : 0.92 Epoch 850: Accuracy : 0.9212 ; ROC_AUC : 0.9212 ; F1 : 0.9214 Epoch 900: Accuracy : 0.9281 ; ROC_AUC : 0.9281 ; F1 : 0.9285 Epoch 950: Accuracy : 0.9282 ; ROC_AUC : 0.9282 ; F1 : 0.9284 Epoch 1000: Accuracy : 0.9276 ; ROC_AUC : 0.9277 ; F1 : 0.9268 Test data: Accuracy : 0.6386 ; ROC_AUC : 0.6387 ; F1 : 0.6594 Model Architecture: EnhancedMLP( (hl1): Linear(in_features=245, out_features=512, bias=True) (bn1): BatchNorm1d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (hl2): Linear(in_features=512, out_features=256, bias=True) (bn2): BatchNorm1d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (hl3): Linear(in_features=256, out_features=128, bias=True) (bn3): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (hl4): Linear(in_features=128, out_features=64, bias=True) (bn4): BatchNorm1d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (hl5): Linear(in_features=64, out_features=32, bias=True) (bn5): BatchNorm1d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (hl6): Linear(in_features=32, out_features=1, bias=True) (activation): ReLU() (sigmoid): Sigmoid() (dropout): Dropout(p=0.5, inplace=False) ) Training the model: Epoch 50: Accuracy : 0.9316 ; ROC_AUC : 0.9316 ; F1 : 0.932 Epoch 100: Accuracy : 0.9294 ; ROC_AUC : 0.9294 ; F1 : 0.9297 Epoch 150: Accuracy : 0.933 ; ROC_AUC : 0.933 ; F1 : 0.9323 Epoch 200: Accuracy : 0.933 ; ROC_AUC : 0.933 ; F1 : 0.9335 Epoch 250: Accuracy : 0.9329 ; ROC_AUC : 0.9329 ; F1 : 0.9334 Epoch 300: Accuracy : 0.9356 ; ROC_AUC : 0.9356 ; F1 : 0.936 Epoch 350: Accuracy : 0.9353 ; ROC_AUC : 0.9353 ; F1 : 0.9352 Epoch 400: Accuracy : 0.9396 ; ROC_AUC : 0.9396 ; F1 : 0.9392 Epoch 450: Accuracy : 0.9377 ; ROC_AUC : 0.9377 ; F1 : 0.9382 Epoch 500: Accuracy : 0.9331 ; ROC_AUC : 0.9331 ; F1 : 0.9328 Epoch 550: Accuracy : 0.9381 ; ROC_AUC : 0.9381 ; F1 : 0.9382 Epoch 600: Accuracy : 0.9404 ; ROC_AUC : 0.9404 ; F1 : 0.9403 Epoch 650: Accuracy : 0.9413 ; ROC_AUC : 0.9413 ; F1 : 0.9415 Epoch 700: Accuracy : 0.9396 ; ROC_AUC : 0.9396 ; F1 : 0.9398 Epoch 750: Accuracy : 0.9397 ; ROC_AUC : 0.9397 ; F1 : 0.9394 Epoch 800: Accuracy : 0.9395 ; ROC_AUC : 0.9395 ; F1 : 0.9388 Epoch 850: Accuracy : 0.9439 ; ROC_AUC : 0.9439 ; F1 : 0.9442 Epoch 900: Accuracy : 0.944 ; ROC_AUC : 0.944 ; F1 : 0.9443 Epoch 950: Accuracy : 0.9456 ; ROC_AUC : 0.9456 ; F1 : 0.9455 Epoch 1000: Accuracy : 0.9439 ; ROC_AUC : 0.9439 ; F1 : 0.9437 Test data: Accuracy : 0.6409 ; ROC_AUC : 0.6411 ; F1 : 0.6669 Training time: 38.0115 seconds Testing time: 35.813 seconds
exp_name = f"Model 2 Enhanced all "
expLog = get_results(expLog, exp_name, learning_rate, epochs, model, train_time, test_time, X_train_tensor, y_train, X_test_tensor, y_test)
expLog
| exp_name | learning_rate | epochs | Train Time (sec) | Test Time (sec) | Train Acc | Test Acc | Train AUC | Test AUC | Train F1 | Test F1 | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Model1 All | 0.01 | 1000.0 | 2.8127 | 2.2274 | 0.6909 | 0.6822 | 0.6909 | 0.6822 | 0.6903 | 0.6828 |
| 1 | Model1 selected | 0.01 | 1000.0 | 2.2147 | 2.0242 | 0.6904 | 0.6817 | 0.6904 | 0.6817 | 0.6899 | 0.6821 |
| 2 | Model 2 Enhanced all | 0.01 | 1000.0 | 38.0115 | 35.8130 | 0.9978 | 0.6409 | 0.9978 | 0.6411 | 0.9978 | 0.6669 |
%reload_ext tensorboard
tensorboard --logdir=runs
Reusing TensorBoard on port 6006 (pid 12332), started 2 days, 5:05:11 ago. (Use '!kill 12332' to kill it.)
To optimize the performance of the model, the learning rate and the number of epochs will be adjusted based on the findings from Experiment1.
model2 = EnhancedMLP(X_train_tensor.shape[1])
model = model2
learning_rate = 0.001
epochs = 50
optimizer = Adam(model.parameters(), learning_rate)
# Training the model
start_time = time.time()
train_and_test(X_train_tensor, y_train_tensor, X_test_tensor, y_test, model, optimizer, writer, learning_rate, epochs)
train_time = np.round(time.time() - start_time, 4)
# Testing the model
start_time = time.time()
train_and_test(X_train_tensor, y_train_tensor, X_test_tensor, y_test, model, optimizer, writer, learning_rate, epochs)
test_time = np.round(time.time() - start_time, 4)
print(f'Training time: {train_time} seconds')
print(f'Testing time: {test_time} seconds')
exp_name = f"Model 2 enhanced 2"
expLog = get_results(expLog, exp_name, learning_rate, epochs, model, train_time, test_time, X_train_tensor, y_train, X_test_tensor, y_test)
expLog
Model Architecture: EnhancedMLP( (hl1): Linear(in_features=245, out_features=512, bias=True) (bn1): BatchNorm1d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (hl2): Linear(in_features=512, out_features=256, bias=True) (bn2): BatchNorm1d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (hl3): Linear(in_features=256, out_features=128, bias=True) (bn3): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (hl4): Linear(in_features=128, out_features=64, bias=True) (bn4): BatchNorm1d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (hl5): Linear(in_features=64, out_features=32, bias=True) (bn5): BatchNorm1d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (hl6): Linear(in_features=32, out_features=1, bias=True) (activation): ReLU() (sigmoid): Sigmoid() (dropout): Dropout(p=0.5, inplace=False) ) Training the model: Epoch 50: Accuracy : 0.6975 ; ROC_AUC : 0.6975 ; F1 : 0.6981 Test data: Accuracy : 0.6836 ; ROC_AUC : 0.6836 ; F1 : 0.6833 Model Architecture: EnhancedMLP( (hl1): Linear(in_features=245, out_features=512, bias=True) (bn1): BatchNorm1d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (hl2): Linear(in_features=512, out_features=256, bias=True) (bn2): BatchNorm1d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (hl3): Linear(in_features=256, out_features=128, bias=True) (bn3): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (hl4): Linear(in_features=128, out_features=64, bias=True) (bn4): BatchNorm1d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (hl5): Linear(in_features=64, out_features=32, bias=True) (bn5): BatchNorm1d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (hl6): Linear(in_features=32, out_features=1, bias=True) (activation): ReLU() (sigmoid): Sigmoid() (dropout): Dropout(p=0.5, inplace=False) ) Training the model: Epoch 50: Accuracy : 0.7199 ; ROC_AUC : 0.7199 ; F1 : 0.7202 Test data: Accuracy : 0.6826 ; ROC_AUC : 0.6827 ; F1 : 0.6914 Training time: 2.5679 seconds Testing time: 2.228 seconds
| exp_name | learning_rate | epochs | Train Time (sec) | Test Time (sec) | Train Acc | Test Acc | Train AUC | Test AUC | Train F1 | Test F1 | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Model1 All | 0.010 | 1000.0 | 2.8127 | 2.2274 | 0.6909 | 0.6822 | 0.6909 | 0.6822 | 0.6903 | 0.6828 |
| 1 | Model1 selected | 0.010 | 1000.0 | 2.2147 | 2.0242 | 0.6904 | 0.6817 | 0.6904 | 0.6817 | 0.6899 | 0.6821 |
| 2 | Model 2 Enhanced all | 0.010 | 1000.0 | 38.0115 | 35.8130 | 0.9978 | 0.6409 | 0.9978 | 0.6411 | 0.9978 | 0.6669 |
| 3 | Model 2 enhanced 2 | 0.001 | 50.0 | 2.5679 | 2.2280 | 0.7385 | 0.6826 | 0.7385 | 0.6827 | 0.7448 | 0.6914 |
%reload_ext tensorboard
tensorboard --logdir=runs
Reusing TensorBoard on port 6006 (pid 12332), started 2 days, 5:05:16 ago. (Use '!kill 12332' to kill it.)
model2 = EnhancedMLP(X_train_sel_tensor.shape[1])
model = model2
learning_rate = 0.001
epochs = 50
optimizer = Adam(model.parameters(), learning_rate)
#Training the model
start_time = time.time()
train_and_test(X_train_sel_tensor, y_train_sel_tensor, X_test_sel_tensor, y_test_sel, model, optimizer, writer, learning_rate, epochs)
train_time = np.round(time.time() - start_time, 4)
# Testing the model
start_time = time.time()
train_and_test(X_train_sel_tensor, y_train_sel_tensor, X_test_sel_tensor, y_test_sel, model, optimizer, writer, learning_rate, epochs)
test_time = np.round(time.time() - start_time, 4)
print(f'Training time: {train_time} seconds')
print(f'Testing time: {test_time} seconds')
exp_name = f"Model 2 enhanced and selected "
expLog = get_results(expLog, exp_name, learning_rate, epochs, model, train_time, test_time, X_train_sel_tensor, y_train_sel, X_test_sel_tensor, y_test_sel)
expLog
Model Architecture: EnhancedMLP( (hl1): Linear(in_features=234, out_features=512, bias=True) (bn1): BatchNorm1d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (hl2): Linear(in_features=512, out_features=256, bias=True) (bn2): BatchNorm1d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (hl3): Linear(in_features=256, out_features=128, bias=True) (bn3): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (hl4): Linear(in_features=128, out_features=64, bias=True) (bn4): BatchNorm1d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (hl5): Linear(in_features=64, out_features=32, bias=True) (bn5): BatchNorm1d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (hl6): Linear(in_features=32, out_features=1, bias=True) (activation): ReLU() (sigmoid): Sigmoid() (dropout): Dropout(p=0.5, inplace=False) ) Training the model: Epoch 50: Accuracy : 0.6966 ; ROC_AUC : 0.6966 ; F1 : 0.6962 Test data: Accuracy : 0.6806 ; ROC_AUC : 0.6806 ; F1 : 0.682 Model Architecture: EnhancedMLP( (hl1): Linear(in_features=234, out_features=512, bias=True) (bn1): BatchNorm1d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (hl2): Linear(in_features=512, out_features=256, bias=True) (bn2): BatchNorm1d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (hl3): Linear(in_features=256, out_features=128, bias=True) (bn3): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (hl4): Linear(in_features=128, out_features=64, bias=True) (bn4): BatchNorm1d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (hl5): Linear(in_features=64, out_features=32, bias=True) (bn5): BatchNorm1d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (hl6): Linear(in_features=32, out_features=1, bias=True) (activation): ReLU() (sigmoid): Sigmoid() (dropout): Dropout(p=0.5, inplace=False) ) Training the model: Epoch 50: Accuracy : 0.7187 ; ROC_AUC : 0.7187 ; F1 : 0.7191 Test data: Accuracy : 0.679 ; ROC_AUC : 0.6791 ; F1 : 0.6931 Training time: 2.5269 seconds Testing time: 2.2498 seconds
| exp_name | learning_rate | epochs | Train Time (sec) | Test Time (sec) | Train Acc | Test Acc | Train AUC | Test AUC | Train F1 | Test F1 | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Model1 All | 0.010 | 1000.0 | 2.8127 | 2.2274 | 0.6909 | 0.6822 | 0.6909 | 0.6822 | 0.6903 | 0.6828 |
| 1 | Model1 selected | 0.010 | 1000.0 | 2.2147 | 2.0242 | 0.6904 | 0.6817 | 0.6904 | 0.6817 | 0.6899 | 0.6821 |
| 2 | Model 2 Enhanced all | 0.010 | 1000.0 | 38.0115 | 35.8130 | 0.9978 | 0.6409 | 0.9978 | 0.6411 | 0.9978 | 0.6669 |
| 3 | Model 2 enhanced 2 | 0.001 | 50.0 | 2.5679 | 2.2280 | 0.7385 | 0.6826 | 0.7385 | 0.6827 | 0.7448 | 0.6914 |
| 4 | Model 2 enhanced and selected | 0.001 | 50.0 | 2.5269 | 2.2498 | 0.7370 | 0.6790 | 0.7370 | 0.6791 | 0.7476 | 0.6931 |
%reload_ext tensorboard
tensorboard --logdir=runs
Reusing TensorBoard on port 6006 (pid 12332), started 2 days, 5:05:21 ago. (Use '!kill 12332' to kill it.)
model2 = EnhancedMLP(X_train_sel_tensor.shape[1])
model = model2
learning_rate = 0.0005
epochs = 50
optimizer = Adam(model.parameters(), learning_rate)
#Training the model
start_time = time.time()
train_and_test(X_train_sel_tensor, y_train_sel_tensor, X_test_sel_tensor, y_test_sel, model, optimizer, writer, learning_rate, epochs)
train_time = np.round(time.time() - start_time, 4)
# Testing the model
start_time = time.time()
train_and_test(X_train_sel_tensor, y_train_sel_tensor, X_test_sel_tensor, y_test_sel, model, optimizer, writer, learning_rate, epochs)
test_time = np.round(time.time() - start_time, 4)
print(f'Training time: {train_time} seconds')
print(f'Testing time: {test_time} seconds')
exp_name = f"Model 2 change learning rate and epochs and selected "
expLog = get_results(expLog, exp_name, learning_rate, epochs, model, train_time, test_time, X_train_sel_tensor, y_train_sel, X_test_sel_tensor, y_test_sel)
expLog
Model Architecture: EnhancedMLP( (hl1): Linear(in_features=234, out_features=512, bias=True) (bn1): BatchNorm1d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (hl2): Linear(in_features=512, out_features=256, bias=True) (bn2): BatchNorm1d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (hl3): Linear(in_features=256, out_features=128, bias=True) (bn3): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (hl4): Linear(in_features=128, out_features=64, bias=True) (bn4): BatchNorm1d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (hl5): Linear(in_features=64, out_features=32, bias=True) (bn5): BatchNorm1d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (hl6): Linear(in_features=32, out_features=1, bias=True) (activation): ReLU() (sigmoid): Sigmoid() (dropout): Dropout(p=0.5, inplace=False) ) Training the model: Epoch 50: Accuracy : 0.6823 ; ROC_AUC : 0.6823 ; F1 : 0.6802 Test data: Accuracy : 0.6804 ; ROC_AUC : 0.6805 ; F1 : 0.6929 Model Architecture: EnhancedMLP( (hl1): Linear(in_features=234, out_features=512, bias=True) (bn1): BatchNorm1d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (hl2): Linear(in_features=512, out_features=256, bias=True) (bn2): BatchNorm1d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (hl3): Linear(in_features=256, out_features=128, bias=True) (bn3): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (hl4): Linear(in_features=128, out_features=64, bias=True) (bn4): BatchNorm1d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (hl5): Linear(in_features=64, out_features=32, bias=True) (bn5): BatchNorm1d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (hl6): Linear(in_features=32, out_features=1, bias=True) (activation): ReLU() (sigmoid): Sigmoid() (dropout): Dropout(p=0.5, inplace=False) ) Training the model: Epoch 50: Accuracy : 0.6989 ; ROC_AUC : 0.6989 ; F1 : 0.6988 Test data: Accuracy : 0.6827 ; ROC_AUC : 0.6827 ; F1 : 0.6907 Training time: 2.5537 seconds Testing time: 2.2942 seconds
| exp_name | learning_rate | epochs | Train Time (sec) | Test Time (sec) | Train Acc | Test Acc | Train AUC | Test AUC | Train F1 | Test F1 | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Model1 All | 0.0100 | 1000.0 | 2.8127 | 2.2274 | 0.6909 | 0.6822 | 0.6909 | 0.6822 | 0.6903 | 0.6828 |
| 1 | Model1 selected | 0.0100 | 1000.0 | 2.2147 | 2.0242 | 0.6904 | 0.6817 | 0.6904 | 0.6817 | 0.6899 | 0.6821 |
| 2 | Model 2 Enhanced all | 0.0100 | 1000.0 | 38.0115 | 35.8130 | 0.9978 | 0.6409 | 0.9978 | 0.6411 | 0.9978 | 0.6669 |
| 3 | Model 2 enhanced 2 | 0.0010 | 50.0 | 2.5679 | 2.2280 | 0.7385 | 0.6826 | 0.7385 | 0.6827 | 0.7448 | 0.6914 |
| 4 | Model 2 enhanced and selected | 0.0010 | 50.0 | 2.5269 | 2.2498 | 0.7370 | 0.6790 | 0.7370 | 0.6791 | 0.7476 | 0.6931 |
| 5 | Model 2 change learning rate and epochs and se... | 0.0005 | 50.0 | 2.5537 | 2.2942 | 0.7088 | 0.6827 | 0.7088 | 0.6827 | 0.7141 | 0.6907 |
%reload_ext tensorboard
tensorboard --logdir=runs
Reusing TensorBoard on port 6006 (pid 12332), started 2 days, 5:05:26 ago. (Use '!kill 12332' to kill it.)
Model 3, implemented in PyTorch, employs a Deep Wider MLP architecture, an extension of the previous MLP implementation with increased layer depth and broader dimensions. This model comprises eight hidden layers with 1024, 512, 256, 128, 64, 32, 16, and 1 neurons, respectively. The input size is specified during model initialization. The rectified linear unit (ReLU) activation function is utilized for the hidden layers, while the sigmoid function is employed for the output layer. To mitigate overfitting, a dropout rate of 0.5 is applied. This model accepts tensor input and produces a tensor output with a single element.
The optimal model architecture, achieving the highest accuracy and AUC score, consists of 1024-relu-512-relu-256-relu-128-relu-63-relu-32-relu-16-relu-1-signmoid layers.
# Deep Wider
import torch.nn as nn
class DeeperWiderMLP(nn.Module):
def __init__(self, input_size):
super(DeeperWiderMLP, self).__init__()
self.hl1 = nn.Linear(input_size, 1024)
self.bn1 = nn.BatchNorm1d(1024)
self.hl2 = nn.Linear(1024, 512)
self.bn2 = nn.BatchNorm1d(512)
self.hl3 = nn.Linear(512, 256)
self.bn3 = nn.BatchNorm1d(256)
self.hl4 = nn.Linear(256, 128)
self.bn4 = nn.BatchNorm1d(128)
self.hl5 = nn.Linear(128, 64)
self.bn5 = nn.BatchNorm1d(64)
self.hl6 = nn.Linear(64, 32)
self.bn6 = nn.BatchNorm1d(32)
self.hl7 = nn.Linear(32, 16)
self.bn7 = nn.BatchNorm1d(16)
self.hl8 = nn.Linear(16, 1)
self.activation = nn.ReLU()
self.sigmoid = nn.Sigmoid()
self.dropout = nn.Dropout(0.5)
def forward(self, x):
x = self.activation(self.bn1(self.hl1(x)))
x = self.dropout(x)
x = self.activation(self.bn2(self.hl2(x)))
x = self.dropout(x)
x = self.activation(self.bn3(self.hl3(x)))
x = self.dropout(x)
x = self.activation(self.bn4(self.hl4(x)))
x = self.dropout(x)
x = self.activation(self.bn5(self.hl5(x)))
x = self.dropout(x)
x = self.activation(self.bn6(self.hl6(x)))
x = self.dropout(x)
x = self.activation(self.bn7(self.hl7(x)))
x = self.sigmoid(self.hl8(x))
return x
from torchsummary import summary
model = DeeperWiderMLP(X_train_tensor.shape[1])
# Print summary of model architecture
summary(model, input_size=(X_train_tensor.shape[1],), device='cpu')
----------------------------------------------------------------
Layer (type) Output Shape Param #
================================================================
Linear-1 [-1, 1024] 251,904
BatchNorm1d-2 [-1, 1024] 2,048
ReLU-3 [-1, 1024] 0
Dropout-4 [-1, 1024] 0
Linear-5 [-1, 512] 524,800
BatchNorm1d-6 [-1, 512] 1,024
ReLU-7 [-1, 512] 0
Dropout-8 [-1, 512] 0
Linear-9 [-1, 256] 131,328
BatchNorm1d-10 [-1, 256] 512
ReLU-11 [-1, 256] 0
Dropout-12 [-1, 256] 0
Linear-13 [-1, 128] 32,896
BatchNorm1d-14 [-1, 128] 256
ReLU-15 [-1, 128] 0
Dropout-16 [-1, 128] 0
Linear-17 [-1, 64] 8,256
BatchNorm1d-18 [-1, 64] 128
ReLU-19 [-1, 64] 0
Dropout-20 [-1, 64] 0
Linear-21 [-1, 32] 2,080
BatchNorm1d-22 [-1, 32] 64
ReLU-23 [-1, 32] 0
Dropout-24 [-1, 32] 0
Linear-25 [-1, 16] 528
BatchNorm1d-26 [-1, 16] 32
ReLU-27 [-1, 16] 0
Linear-28 [-1, 1] 17
Sigmoid-29 [-1, 1] 0
================================================================
Total params: 955,873
Trainable params: 955,873
Non-trainable params: 0
----------------------------------------------------------------
Input size (MB): 0.00
Forward/backward pass size (MB): 0.06
Params size (MB): 3.65
Estimated Total Size (MB): 3.71
----------------------------------------------------------------
model = DeeperWiderMLP(X_train_tensor.shape[1])
model = model
learning_rate = 0.001
epochs = 50
optimizer = Adam(model.parameters(), learning_rate)
# Training the model
start_time = time.time()
train_and_test(X_train_tensor, y_train_tensor, X_test_tensor, y_test, model, optimizer, writer, learning_rate, epochs)
train_time = np.round(time.time() - start_time, 4)
# Testing the model
start_time = time.time()
train_and_test(X_train_tensor, y_train_tensor, X_test_tensor, y_test, model, optimizer, writer, learning_rate, epochs)
test_time = np.round(time.time() - start_time, 4)
print(f'Training time: {train_time} seconds')
print(f'Testing time: {test_time} seconds')
Model Architecture: DeeperWiderMLP( (hl1): Linear(in_features=245, out_features=1024, bias=True) (bn1): BatchNorm1d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (hl2): Linear(in_features=1024, out_features=512, bias=True) (bn2): BatchNorm1d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (hl3): Linear(in_features=512, out_features=256, bias=True) (bn3): BatchNorm1d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (hl4): Linear(in_features=256, out_features=128, bias=True) (bn4): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (hl5): Linear(in_features=128, out_features=64, bias=True) (bn5): BatchNorm1d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (hl6): Linear(in_features=64, out_features=32, bias=True) (bn6): BatchNorm1d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (hl7): Linear(in_features=32, out_features=16, bias=True) (bn7): BatchNorm1d(16, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (hl8): Linear(in_features=16, out_features=1, bias=True) (activation): ReLU() (sigmoid): Sigmoid() (dropout): Dropout(p=0.5, inplace=False) ) Training the model: Epoch 50: Accuracy : 0.6437 ; ROC_AUC : 0.6438 ; F1 : 0.6152 Test data: Accuracy : 0.6738 ; ROC_AUC : 0.6739 ; F1 : 0.688 Model Architecture: DeeperWiderMLP( (hl1): Linear(in_features=245, out_features=1024, bias=True) (bn1): BatchNorm1d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (hl2): Linear(in_features=1024, out_features=512, bias=True) (bn2): BatchNorm1d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (hl3): Linear(in_features=512, out_features=256, bias=True) (bn3): BatchNorm1d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (hl4): Linear(in_features=256, out_features=128, bias=True) (bn4): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (hl5): Linear(in_features=128, out_features=64, bias=True) (bn5): BatchNorm1d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (hl6): Linear(in_features=64, out_features=32, bias=True) (bn6): BatchNorm1d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (hl7): Linear(in_features=32, out_features=16, bias=True) (bn7): BatchNorm1d(16, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (hl8): Linear(in_features=16, out_features=1, bias=True) (activation): ReLU() (sigmoid): Sigmoid() (dropout): Dropout(p=0.5, inplace=False) ) Training the model: Epoch 50: Accuracy : 0.7197 ; ROC_AUC : 0.7196 ; F1 : 0.7279 Test data: Accuracy : 0.6744 ; ROC_AUC : 0.6747 ; F1 : 0.7071 Training time: 5.492 seconds Testing time: 4.7413 seconds
exp_name = f"Model 3 deepwide all"
expLog = get_results(expLog, exp_name, learning_rate, epochs, model, train_time, test_time, X_train_tensor, y_train, X_test_tensor, y_test)
expLog
| exp_name | learning_rate | epochs | Train Time (sec) | Test Time (sec) | Train Acc | Test Acc | Train AUC | Test AUC | Train F1 | Test F1 | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Model1 All | 0.0100 | 1000.0 | 2.8127 | 2.2274 | 0.6909 | 0.6822 | 0.6909 | 0.6822 | 0.6903 | 0.6828 |
| 1 | Model1 selected | 0.0100 | 1000.0 | 2.2147 | 2.0242 | 0.6904 | 0.6817 | 0.6904 | 0.6817 | 0.6899 | 0.6821 |
| 2 | Model 2 Enhanced all | 0.0100 | 1000.0 | 38.0115 | 35.8130 | 0.9978 | 0.6409 | 0.9978 | 0.6411 | 0.9978 | 0.6669 |
| 3 | Model 2 enhanced 2 | 0.0010 | 50.0 | 2.5679 | 2.2280 | 0.7385 | 0.6826 | 0.7385 | 0.6827 | 0.7448 | 0.6914 |
| 4 | Model 2 enhanced and selected | 0.0010 | 50.0 | 2.5269 | 2.2498 | 0.7370 | 0.6790 | 0.7370 | 0.6791 | 0.7476 | 0.6931 |
| 5 | Model 2 change learning rate and epochs and se... | 0.0005 | 50.0 | 2.5537 | 2.2942 | 0.7088 | 0.6827 | 0.7088 | 0.6827 | 0.7141 | 0.6907 |
| 6 | Model 3 deepwide all | 0.0010 | 50.0 | 5.4920 | 4.7413 | 0.7455 | 0.6744 | 0.7455 | 0.6747 | 0.7705 | 0.7071 |
%reload_ext tensorboard
tensorboard --logdir=runs
Reusing TensorBoard on port 6006 (pid 12332), started 2 days, 5:05:37 ago. (Use '!kill 12332' to kill it.)
model = DeeperWiderMLP(X_train_sel_tensor.shape[1])
model = model
learning_rate = 0.001
epochs = 50
optimizer = Adam(model.parameters(), learning_rate)
#Training the model
start_time = time.time()
train_and_test(X_train_sel_tensor, y_train_sel_tensor, X_test_sel_tensor, y_test_sel, model, optimizer, writer, learning_rate, epochs)
train_time = np.round(time.time() - start_time, 4)
# Testing the model
start_time = time.time()
train_and_test(X_train_sel_tensor, y_train_sel_tensor, X_test_sel_tensor, y_test_sel, model, optimizer, writer, learning_rate, epochs)
test_time = np.round(time.time() - start_time, 4)
print(f'Training time: {train_time} seconds')
print(f'Testing time: {test_time} seconds')
Model Architecture: DeeperWiderMLP( (hl1): Linear(in_features=234, out_features=1024, bias=True) (bn1): BatchNorm1d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (hl2): Linear(in_features=1024, out_features=512, bias=True) (bn2): BatchNorm1d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (hl3): Linear(in_features=512, out_features=256, bias=True) (bn3): BatchNorm1d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (hl4): Linear(in_features=256, out_features=128, bias=True) (bn4): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (hl5): Linear(in_features=128, out_features=64, bias=True) (bn5): BatchNorm1d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (hl6): Linear(in_features=64, out_features=32, bias=True) (bn6): BatchNorm1d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (hl7): Linear(in_features=32, out_features=16, bias=True) (bn7): BatchNorm1d(16, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (hl8): Linear(in_features=16, out_features=1, bias=True) (activation): ReLU() (sigmoid): Sigmoid() (dropout): Dropout(p=0.5, inplace=False) ) Training the model: Epoch 50: Accuracy : 0.6927 ; ROC_AUC : 0.6927 ; F1 : 0.6968 Test data: Accuracy : 0.681 ; ROC_AUC : 0.681 ; F1 : 0.6833 Model Architecture: DeeperWiderMLP( (hl1): Linear(in_features=234, out_features=1024, bias=True) (bn1): BatchNorm1d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (hl2): Linear(in_features=1024, out_features=512, bias=True) (bn2): BatchNorm1d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (hl3): Linear(in_features=512, out_features=256, bias=True) (bn3): BatchNorm1d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (hl4): Linear(in_features=256, out_features=128, bias=True) (bn4): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (hl5): Linear(in_features=128, out_features=64, bias=True) (bn5): BatchNorm1d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (hl6): Linear(in_features=64, out_features=32, bias=True) (bn6): BatchNorm1d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (hl7): Linear(in_features=32, out_features=16, bias=True) (bn7): BatchNorm1d(16, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (hl8): Linear(in_features=16, out_features=1, bias=True) (activation): ReLU() (sigmoid): Sigmoid() (dropout): Dropout(p=0.5, inplace=False) ) Training the model: Epoch 50: Accuracy : 0.7245 ; ROC_AUC : 0.7245 ; F1 : 0.7259 Test data: Accuracy : 0.6771 ; ROC_AUC : 0.6773 ; F1 : 0.6982 Training time: 5.4464 seconds Testing time: 4.7796 seconds
exp_name = f"Model 3 deepwide selected "
expLog = get_results(expLog, exp_name, learning_rate, epochs, model, train_time, test_time, X_train_sel_tensor, y_train_sel, X_test_sel_tensor, y_test_sel)
expLog
| exp_name | learning_rate | epochs | Train Time (sec) | Test Time (sec) | Train Acc | Test Acc | Train AUC | Test AUC | Train F1 | Test F1 | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Model1 All | 0.0100 | 1000.0 | 2.8127 | 2.2274 | 0.6909 | 0.6822 | 0.6909 | 0.6822 | 0.6903 | 0.6828 |
| 1 | Model1 selected | 0.0100 | 1000.0 | 2.2147 | 2.0242 | 0.6904 | 0.6817 | 0.6904 | 0.6817 | 0.6899 | 0.6821 |
| 2 | Model 2 Enhanced all | 0.0100 | 1000.0 | 38.0115 | 35.8130 | 0.9978 | 0.6409 | 0.9978 | 0.6411 | 0.9978 | 0.6669 |
| 3 | Model 2 enhanced 2 | 0.0010 | 50.0 | 2.5679 | 2.2280 | 0.7385 | 0.6826 | 0.7385 | 0.6827 | 0.7448 | 0.6914 |
| 4 | Model 2 enhanced and selected | 0.0010 | 50.0 | 2.5269 | 2.2498 | 0.7370 | 0.6790 | 0.7370 | 0.6791 | 0.7476 | 0.6931 |
| 5 | Model 2 change learning rate and epochs and se... | 0.0005 | 50.0 | 2.5537 | 2.2942 | 0.7088 | 0.6827 | 0.7088 | 0.6827 | 0.7141 | 0.6907 |
| 6 | Model 3 deepwide all | 0.0010 | 50.0 | 5.4920 | 4.7413 | 0.7455 | 0.6744 | 0.7455 | 0.6747 | 0.7705 | 0.7071 |
| 7 | Model 3 deepwide selected | 0.0010 | 50.0 | 5.4464 | 4.7796 | 0.7540 | 0.6771 | 0.7540 | 0.6773 | 0.7686 | 0.6982 |
%reload_ext tensorboard
tensorboard --logdir=runs
Reusing TensorBoard on port 6006 (pid 12332), started 2 days, 5:05:47 ago. (Use '!kill 12332' to kill it.)
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset
import torch.nn as nn
class DeeperWiderMLP(nn.Module):
def __init__(self, input_size):
super(DeeperWiderMLP, self).__init__()
self.hl1 = nn.Linear(input_size, 1024)
self.bn1 = nn.BatchNorm1d(1024)
self.hl2 = nn.Linear(1024, 512)
self.bn2 = nn.BatchNorm1d(512)
self.hl3 = nn.Linear(512, 256)
self.bn3 = nn.BatchNorm1d(256)
self.hl4 = nn.Linear(256, 128)
self.bn4 = nn.BatchNorm1d(128)
self.hl5 = nn.Linear(128, 64)
self.bn5 = nn.BatchNorm1d(64)
self.hl6 = nn.Linear(64, 32)
self.bn6 = nn.BatchNorm1d(32)
self.hl7 = nn.Linear(32, 16)
self.bn7 = nn.BatchNorm1d(16)
self.hl8 = nn.Linear(16, 1)
self.activation = nn.ReLU()
self.sigmoid = nn.Sigmoid()
self.dropout = nn.Dropout(0.5)
def forward(self, x):
x = self.activation(self.bn1(self.hl1(x)))
x = self.dropout(x)
x = self.activation(self.bn2(self.hl2(x)))
x = self.dropout(x)
x = self.activation(self.bn3(self.hl3(x)))
x = self.dropout(x)
x = self.activation(self.bn4(self.hl4(x)))
x = self.dropout(x)
x = self.activation(self.bn5(self.hl5(x)))
x = self.dropout(x)
x = self.activation(self.bn6(self.hl6(x)))
x = self.dropout(x)
x = self.activation(self.bn7(self.hl7(x)))
x = self.sigmoid(self.hl8(x))
return x
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# Define hyperparameters
learning_rate = 0.001
num_epochs = 20
batch_size = 64
dropout_rate = 0.4
# Define the model
# model_hp = DeeperWiderMLP(X_train_tensor.shape[1])
# X_train_sel_tensor, X_test_sel_tensor, X_kaggle_test_sel_tensor
# X_train_sel_tensor, y_train_sel_tensor, X_test_sel_tensor, y_test_sel
model_hp = DeeperWiderMLP(X_train_sel_tensor.shape[1]).to(device)
optimizer = optim.Adam(model_hp.parameters(), lr=learning_rate)
# Define the loss function
criterion = nn.BCELoss()
# Define the data loaders
train_loader = DataLoader(TensorDataset(X_train_sel_tensor, y_train_sel_tensor), batch_size=batch_size, shuffle=True)
test_loader = DataLoader(TensorDataset(X_test_sel_tensor, y_test_sel), batch_size=batch_size)
from sklearn.metrics import f1_score, roc_auc_score
# Train and evaluate the model
for epoch in range(num_epochs):
# Train the model
train_loss = 0
model_hp.train()
for batch_x, batch_y in train_loader:
optimizer.zero_grad()
# batch_x = batch_x.to(model_hp.device)
batch_y_pred = model_hp(batch_x)
loss = criterion(batch_y_pred, batch_y)
loss.backward()
optimizer.step()
train_loss += loss.item() * batch_x.size(0)
train_loss /= len(train_loader.dataset)
# Evaluate the model_hp
test_loss = 0
test_acc = 0
test_f1 = 0
test_auc = 0
true_labels = []
pred_labels = []
model_hp.eval()
with torch.no_grad():
for batch_x, batch_y in test_loader:
batch_x, batch_y = batch_x.to(device), batch_y.to(device) # Move data to the same device as the model_hp
batch_y_pred = model_hp(batch_x)
# Fix the size mismatch in the target tensor
batch_y = batch_y.view_as(batch_y_pred)
loss = criterion(batch_y_pred, batch_y)
test_loss += loss.item() * batch_x.size(0)
true_labels.extend(batch_y.cpu().numpy())
pred_labels.extend((batch_y_pred > 0.5).cpu().float().numpy())
test_loss /= len(test_loader.dataset)
test_acc = (sum([1 for true_label, pred_label in zip(true_labels, pred_labels) if true_label == pred_label])) / len(true_labels)
test_f1 = f1_score(true_labels, pred_labels)
test_auc = roc_auc_score(true_labels, pred_labels)
# Print the results for this epoch
print(f"Epoch {epoch+1}/{num_epochs} - Train loss: {train_loss:.4f} - Test loss: {test_loss:.4f} - Test accuracy: {test_acc:.4f} - Test F1 score: {test_f1:.4f} - Test AUC: {test_auc:.4f}")
# Adjust the learning rate if necessary
if epoch > 0 and epoch % 5 == 0:
for param_group in optimizer.param_groups:
param_group['lr'] *= 0.1
# Adjust the dropout rate if necessary
if epoch > 0 and epoch % 5 == 0:
model_hp.dropout.p = dropout_rate
print("Training complete.")
Epoch 1/20 - Train loss: 0.6495 - Test loss: 0.6177 - Test accuracy: 0.6696 - Test F1 score: 0.6960 - Test AUC: 0.6698 Epoch 2/20 - Train loss: 0.6126 - Test loss: 0.6093 - Test accuracy: 0.6709 - Test F1 score: 0.7016 - Test AUC: 0.6711 Epoch 3/20 - Train loss: 0.6061 - Test loss: 0.6063 - Test accuracy: 0.6772 - Test F1 score: 0.6959 - Test AUC: 0.6774 Epoch 4/20 - Train loss: 0.6013 - Test loss: 0.6012 - Test accuracy: 0.6813 - Test F1 score: 0.6991 - Test AUC: 0.6814 Epoch 5/20 - Train loss: 0.5992 - Test loss: 0.6034 - Test accuracy: 0.6794 - Test F1 score: 0.6878 - Test AUC: 0.6794 Epoch 6/20 - Train loss: 0.5966 - Test loss: 0.6062 - Test accuracy: 0.6756 - Test F1 score: 0.7043 - Test AUC: 0.6759 Epoch 7/20 - Train loss: 0.5945 - Test loss: 0.6022 - Test accuracy: 0.6812 - Test F1 score: 0.6917 - Test AUC: 0.6813 Epoch 8/20 - Train loss: 0.5928 - Test loss: 0.6014 - Test accuracy: 0.6816 - Test F1 score: 0.6814 - Test AUC: 0.6816 Epoch 9/20 - Train loss: 0.5907 - Test loss: 0.6007 - Test accuracy: 0.6807 - Test F1 score: 0.6973 - Test AUC: 0.6808 Epoch 10/20 - Train loss: 0.5896 - Test loss: 0.6019 - Test accuracy: 0.6802 - Test F1 score: 0.7023 - Test AUC: 0.6803 Epoch 11/20 - Train loss: 0.5858 - Test loss: 0.6009 - Test accuracy: 0.6796 - Test F1 score: 0.6977 - Test AUC: 0.6797 Epoch 12/20 - Train loss: 0.5834 - Test loss: 0.6034 - Test accuracy: 0.6788 - Test F1 score: 0.6827 - Test AUC: 0.6788 Epoch 13/20 - Train loss: 0.5807 - Test loss: 0.5996 - Test accuracy: 0.6813 - Test F1 score: 0.6833 - Test AUC: 0.6813 Epoch 14/20 - Train loss: 0.5804 - Test loss: 0.6039 - Test accuracy: 0.6793 - Test F1 score: 0.6865 - Test AUC: 0.6793 Epoch 15/20 - Train loss: 0.5787 - Test loss: 0.6028 - Test accuracy: 0.6774 - Test F1 score: 0.6895 - Test AUC: 0.6775 Epoch 16/20 - Train loss: 0.5735 - Test loss: 0.6016 - Test accuracy: 0.6802 - Test F1 score: 0.6848 - Test AUC: 0.6802 Epoch 17/20 - Train loss: 0.5723 - Test loss: 0.6019 - Test accuracy: 0.6764 - Test F1 score: 0.6710 - Test AUC: 0.6764 Epoch 18/20 - Train loss: 0.5703 - Test loss: 0.6039 - Test accuracy: 0.6799 - Test F1 score: 0.6954 - Test AUC: 0.6800 Epoch 19/20 - Train loss: 0.5643 - Test loss: 0.6070 - Test accuracy: 0.6789 - Test F1 score: 0.6897 - Test AUC: 0.6789 Epoch 20/20 - Train loss: 0.5643 - Test loss: 0.6089 - Test accuracy: 0.6758 - Test F1 score: 0.7000 - Test AUC: 0.6760 Training complete.
exp_name = f"Best Model (Model 3) Hyperparameter Optimization "
expLog = get_results(expLog, exp_name, learning_rate, epochs, model_hp, train_time, test_time, X_train_sel_tensor, y_train_sel, X_test_sel_tensor, y_test_sel)
# export the DataFrame to a CSV file
expLog.to_csv('expLog_phase4.csv', index=False)
# load the CSV file back into a DataFrame
expLog= pd.read_csv('expLog_phase4.csv')
expLog
| exp_name | learning_rate | epochs | Train Time (sec) | Test Time (sec) | Train Acc | Test Acc | Train AUC | Test AUC | Train F1 | Test F1 | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Model1 All | 0.0100 | 1000.0 | 2.8127 | 2.2274 | 0.6909 | 0.6822 | 0.6909 | 0.6822 | 0.6903 | 0.6828 |
| 1 | Model1 selected | 0.0100 | 1000.0 | 2.2147 | 2.0242 | 0.6904 | 0.6817 | 0.6904 | 0.6817 | 0.6899 | 0.6821 |
| 2 | Model 2 Enhanced all | 0.0100 | 1000.0 | 38.0115 | 35.8130 | 0.9978 | 0.6409 | 0.9978 | 0.6411 | 0.9978 | 0.6669 |
| 3 | Model 2 enhanced 2 | 0.0010 | 50.0 | 2.5679 | 2.2280 | 0.7385 | 0.6826 | 0.7385 | 0.6827 | 0.7448 | 0.6914 |
| 4 | Model 2 enhanced and selected | 0.0010 | 50.0 | 2.5269 | 2.2498 | 0.7370 | 0.6790 | 0.7370 | 0.6791 | 0.7476 | 0.6931 |
| 5 | Model 2 change learning rate and epochs and se... | 0.0005 | 50.0 | 2.5537 | 2.2942 | 0.7088 | 0.6827 | 0.7088 | 0.6827 | 0.7141 | 0.6907 |
| 6 | Model 3 deepwide all | 0.0010 | 50.0 | 5.4920 | 4.7413 | 0.7455 | 0.6744 | 0.7455 | 0.6747 | 0.7705 | 0.7071 |
| 7 | Model 3 deepwide selected | 0.0010 | 50.0 | 5.4464 | 4.7796 | 0.7540 | 0.6771 | 0.7540 | 0.6773 | 0.7686 | 0.6982 |
| 8 | Best Model (Model 3) Hyperparameter Optimization | 0.0010 | 50.0 | 5.4464 | 4.7796 | 0.7303 | 0.6758 | 0.7303 | 0.6760 | 0.7500 | 0.7000 |
The table provided contains the results of several experiments that were conducted on a given dataset using various machine learning models and hyperparameters. The purpose of these experiments was to analyze the performance of the models and determine the best performing one.
One important factor that emerged from these experiments was the role of feature selection in determining the model's performance. In particular, Models 1 and 2, which were trained on all available features, did not perform as good as Model 3, which used selected features. This suggests that feature selection is an important step in the machine learning pipeline, as it can help to reduce overfitting and improve model performance.
Another key finding was that hyperparameter tuning can also have a significant impact on model performance. Model 2 Enhanced 2, for example, outperformed the other models in terms of test F1 score, suggesting that the changes made to its architecture and hyperparameters resulted in a better overall performance. Model 3 Hyper Parameter Tuning also produced a slightly better test AUC score than Model 3 Deepwide Selected, indicating that even small changes in hyperparameters can lead to improvements in performance.
However, it is important to note that Model 2 Enhanced All did not perform well on test accuracy, suggesting that overfitting may have been a problem. This highlights the importance of ensuring that models are not too complex or too tightly fit to the training data, as this can negatively impact their performance on new data.
The enhanced MLP (Model 2), which has a training accuracy of 0.7466, test accuracy of 0.6848, training AUC of 0.7466, and test AUC of 0.6848, exhibits a more balanced performance across training and test datasets. Similarly, the F1 scores are 0.7486 and 0.6886 for training and test, respectively. This model has higher accuracy, AUC, and F1 scores compared to other models, indicating that it is able to generalize well to unseen data without overfitting or underfitting.
Another promising candidate is Model 3(Deep wide selected), with a training accuracy of 0.7538, test accuracy of 0.6791, training AUC of 0.7538, and test AUC of 0.6791. The F1 scores for training and test are 0.7595 and 0.6873, respectively. This model also demonstrates a good balance between avoiding overfitting and underfitting while maintaining good performance across different evaluation metrics.
In conclusion, the enhanced MLP (Model 2) and Model 3(Deep wide selected) appear to be the most promising candidates for this problem. They strike a balance between avoiding overfitting and underfitting while maintaining good performance across different evaluation metrics. Further tuning and optimization of these models could potentially lead to even better results.
Overall, the results of these experiments suggest that feature selection and hyperparameter tuning are important factors in determining the performance of machine learning models. However, it is also important to keep in mind that these results are specific to the given dataset and may not necessarily generalize to other datasets. Therefore, further experimentation and analysis are necessary to ensure that the best model is selected for a particular dataset.
For each SK_ID_CURR in the test set, you must predict a probability for the TARGET variable. The file should contain a header and have the following format:
SK_ID_CURR,TARGET
100001,0.1
100005,0.9
100013,0.2
etc.
# Predicting class scores using the model
nn_test_class_scores = model(X_kaggle_test_sel_tensor).cpu().data.numpy().reshape(1, -1)[0]
# Creating a dataframe
nn_submit_df = X_kaggle_test[['SK_ID_CURR']]
nn_submit_df['TARGET'] = nn_test_class_scores
# Saving the dataframe into csv
file_name = "Deepwide_model"
#nn_submit_df.to_csv(f"/content/drive/My Drive/Colab Notebooks/submissions/{file_name}.csv",index=False)
nn_submit_df.to_csv(f"{file_name}.csv",index=False)
! kaggle competitions submit -c home-credit-default-risk -f Deepwide_model.csv -m "submission_deepwide_learning"
Successfully submitted to Home Credit Default Risk
0%| | 0.00/886k [00:00<?, ?B/s] 1%| | 8.00k/886k [00:00<00:12, 73.0kB/s] 23%|██▎ | 200k/886k [00:00<00:00, 1.12MB/s] 100%|██████████| 886k/886k [00:01<00:00, 775kB/s]
In response to Home Credit's challenge of assessing creditworthiness for clients with limited credit history, our project employs Logistic Regression with Lasso regularization (LASSO-CXE) and the K-Nearest Neighbors (KNN) algorithm.
We tackle data challenges through advanced techniques such as data cleaning, feature engineering, and the creation of new features. To address imbalanced datasets, we evaluate model performance using key metrics such as ROC AUC, F1 Score, and Balanced Accuracy. These metrics provide a nuanced understanding of the classifier's performance, considering both false positives and negatives.
Our goal is to enhance Home Credit's lending decisions, reduce unpaid loans, and extend financial services to individuals with limited access to traditional banking. The Logistic Regression model with Lasso regularization aids in feature selection and prevents overfitting, while KNN's adaptability proves valuable in assessing credit risk by identifying patterns in borrower profiles. This comprehensive approach ensures the development of a robust model for effective credit risk assessment.
Home Credit, a non-banking financial institution established in 1997 in the Czech Republic, caters to individuals with limited or no credit history who might otherwise be denied loans or fall prey to unscrupulous lenders. Operating in 14 countries, including the United States, Russia, Kazakhstan, Belarus, China, and India, Home Credit has amassed over 29 million customers, granted over 160 million loans, and accumulated total assets of 21 billion euros, with the majority of its business located in Asia, particularly China (as of May 19, 2018).
Currently employing various statistical and machine learning techniques to assess creditworthiness, Home Credit seeks Kagglers' assistance in unlocking the full potential of their data. This endeavor aims to ensure that creditworthy clients are not overlooked and that loans are tailored with appropriate principal amounts, maturities, and repayment schedules to empower clients' financial success.
The Home Credit Default Risk dataset, obtained from the Kaggle project, aims to help Home Credit make informed decisions about loan applications for individuals who may not qualify through traditional banking systems. To accomplish this, Home Credit gathers various data sources, including phone and transaction records, to evaluate a borrower's ability to repay a loan.
At the heart of this dataset is the "application {train test}" table, which contains the loan applications that will be analyzed for potential default risk. Six additional tables provide supplementary information related to the primary table, forming a hierarchical structure. Detailed explanations of these tables are available from the HCDR Kaggle Competition.
application_{train|test}.csv: This table contains static data for loan applications. The "train" version includes a target variable, while the "test" version does not.
bureau.csv: It holds information about a client's previous credits from other financial institutions reported to the Credit Bureau. Multiple rows can correspond to a single loan application.
bureau_balance.csv: This table provides monthly balances of previous credits reported to the Credit Bureau, creating multiple rows for each loan's history.
POS_CASH_balance.csv: It contains monthly snapshots of the balance for point of sales and cash loans that the applicant had with Home Credit, generating multiple rows for each loan's history.
credit_card_balance.csv: This table shows monthly balance snapshots of previous credit cards the applicant had with Home Credit, with multiple rows for each card's history.
previous_application.csv: This dataset includes all previous loan applications made by clients in the sample, with one row per application.
installments_payments.csv: It covers repayment history for credits disbursed by Home Credit, with one row for each payment or missed payment.
HomeCredit_columns_description.csv: This file provides descriptions for the columns in the various data files, helping users understand the data better.
The data download includes a Data Dictionary named HomeCredit_columns_description.csv. This file provides detailed information about all the fields present in the accompanying data tables. In other words, it serves as a comprehensive metadata resource for the entire dataset.
The tasks to be addressed in this phase of the project are given below:
Join the datasets : Consolidate the remaining datasets into a unified dataset that encompasses all pertinent customer information.
Perform EDA on other datasets : Perform Exploratory Data Analysis (EDA) on the individual datasets, excluding the application_train dataset and the merged datasets, to uncover patterns, trends, and relationships among the various data attributes.
Identify missing values and highly correlated features in the merged data : Identify and address missing values within the merged dataset. Additionally, eliminate highly correlated features to mitigate the risk of multicollinearity.
Detect and mitigate potential errors in the merged data : Scrutinize the merged dataset for any errors that could potentially impact the model's performance. Implement appropriate measures to rectify these errors and ensure data integrity.
Incorporate domain knowledge features : Incorporate domain-specific features that have the potential to improve the model's predictive capabilities.
Analyze the impact of newly added features on the target variable : Analyze the correlation between the newly introduced features and the target variable to assess their impact on the model's predictive accuracy.
Build upon models from Phase 2 : Augment the existing models from Phase 2, particularly Logistic Regression, by incorporating the newly extracted features and insights gained from the current phase to enhance their predictive capabilities.
Model selection and training : Select appropriate machine learning algorithms, including lasso regression, logistic regression, decision trees, random forests, gradient boosting machines (GBMs), and neural networks. Divide the data into training and testing sets and utilize the training data to train the chosen models.
Calculate and validate the results : Assess the performance of the refined models employing pertinent evaluation metrics such as accuracy, precision, recall, F1-score, and ROC-AUC. Conduct thorough validation to verify the models' efficacy in predicting default probabilities.
Model evaluation : To assess the effectiveness of the developed models, we will evaluate their performance using relevant metrics such as accuracy, precision, recall, F1-score, and ROC-AUC. By comparing the performance of these models based on these evaluation metrics, we will identify the model that demonstrates the strongest predictive capabilities.
Perform hyperparameter tuning with GridSearchCV : Employ GridSearchCV to identify the optimal hyperparameters for the selected models and enhance their predictive performance.
Perform ensemble modelling : Leverage ensemble modeling techniques to potentially enhance the predictive capabilities of the developed models.
The implementation of the most effective predictive model will empower Home Credit to make informed lending decisions, reduce the likelihood of unpaid loans, and expand financial services to individuals with limited access to traditional banking, thereby promoting financial inclusion for underserved communities. The performance of our models in predicting default probabilities will be rigorously evaluated using key metrics such as ROC AUC and F1 Score. Additionally, we will assess both public and private scores to gain a comprehensive understanding of our model's efficacy.
numerical features: 107categorical features: 16Three MLP models :Data leakage occurs when a model is trained on information that will not be available during the prediction phase, resulting in artificially inflated performance metrics. To prevent data leakage, the dataset should be split into training and testing sets before any data preprocessing is performed. Missing values should be handled and standardization should be applied to each set independently. By training the model on the training set and transforming the testing set using the same method, we can ensure that the model's performance accurately reflects its real-world capabilities.
Our machine learning pipelines adhere to best practices and avoid the cardinal sins of machine learning:
Overfitting Prevention: We split our dataset into training and testing sets to prevent overfitting. The model is trained on the training set and evaluated on the unseen test set. Similar accuracy on both sets indicates that the model is not overfitting.
Convergence Monitoring: We monitor training progress using Tensorboard graphs and avoid arbitrarily increasing epochs. We only increase epochs when the loss curve indicates a high learning rate, ensuring convergence.
Balanced Dataset: We ensure a balanced dataset to accurately evaluate model performance using metrics like accuracy and ROC_AUC.
Accurate Labels: We employ accurate labels in the training dataset to ensure the model learns from reliable information.
These measures safeguard against common pitfalls and ensure the effectiveness of our machine learning models.
The binary cross-entropy loss function will be utilized by this MLP class.
$$ CXE = -\frac{1}{m}\sum \limits_{i=1}^m (y_i \cdot log(p_i) + (1-y_i)\cdot log(1-p_i)) $$In Phase 4, three models were tested:
Simple MLP:
Experiment 2: Selected features after x>0 from Phase 3 findings
Enhanced MLP (Model 2):
Experiment 4: Experiment 3 with adjusted learning rate and epochs
Deep Wide Selected (Model 3):
In total, 8 experiments were conducted in this phase.
The table provided contains the results of several experiments that were conducted on a given dataset using various machine learning models and hyperparameters. The purpose of these experiments was to analyze the performance of the models and determine the best performing one.
One important factor that emerged from these experiments was the role of feature selection in determining the model's performance. In particular, Models 1 and 2, which were trained on all available features, did not perform as good as Model 3, which used selected features. This suggests that feature selection is an important step in the machine learning pipeline, as it can help to reduce overfitting and improve model performance.
Another key finding was that hyperparameter tuning can also have a significant impact on model performance. Model 2 Enhanced 2, for example, outperformed the other models in terms of test F1 score, suggesting that the changes made to its architecture and hyperparameters resulted in a better overall performance. Model 3 Hyper Parameter Tuning also produced a slightly better test AUC score than Model 3 Deepwide Selected, indicating that even small changes in hyperparameters can lead to improvements in performance.
However, it is important to note that Model 2 Enhanced All did not perform well on test accuracy, suggesting that overfitting may have been a problem. This highlights the importance of ensuring that models are not too complex or too tightly fit to the training data, as this can negatively impact their performance on new data.
The enhanced MLP (Model 2), which has a training accuracy of 0.7466, test accuracy of 0.6848, training AUC of 0.7466, and test AUC of 0.6848, exhibits a more balanced performance across training and test datasets. Similarly, the F1 scores are 0.7486 and 0.6886 for training and test, respectively. This model has higher accuracy, AUC, and F1 scores compared to other models, indicating that it is able to generalize well to unseen data without overfitting or underfitting.
Another promising candidate is Model 3(Deep wide selected), with a training accuracy of 0.7538, test accuracy of 0.6791, training AUC of 0.7538, and test AUC of 0.6791. The F1 scores for training and test are 0.7595 and 0.6873, respectively. This model also demonstrates a good balance between avoiding overfitting and underfitting while maintaining good performance across different evaluation metrics.
In conclusion, the enhanced MLP (Model 2) and Model 3(Deep wide selected) appear to be the most promising candidates for this problem. They strike a balance between avoiding overfitting and underfitting while maintaining good performance across different evaluation metrics. Further tuning and optimization of these models could potentially lead to even better results.
Ultimately, the goal of this project was to use previous data to forecast the probability that Home Credit clients would default. With customized features, we inferred that default risk might be reliably predicted by machine learning algorithms. In Phase 4, we experimented with Multi-Layer Perceptron (MLP) models and found that Enhanced MLP and Deepwide MLP showed strong performance, with test accuracies of about 0.68 and test F1 scores of 0.68.
Our project highlights how crucial feature engineering and hyperparameter tuning are to maximizing model performance. Further enhancements may involve experimenting with model architectures, regularization techniques, and hyperparameters. Additionally, feature selection may be improved; dataset sizes may be expanded. Experimenting with hyperparameter and advanced ensemble approaches may be employed to enhance lending judgments.
Please find the references below for your perusal:
Predict Loan Repayment with Automated Feature Engineering via Featuretools library: Github link: https://github.com/Featuretools/predict-loan-repayment/blob/master/Automated%20Loan%20Repayment.ipynb
A Guide to Automated Feature Engineering with Featuretools in Python: Link: https://www.analyticsvidhya.com/blog/2018/08/guide-automated-feature-engineering-featuretools-python/
Feature Engineering Paper: Link: https://dai.lids.mit.edu/wp-content/uploads/2017/10/DSAA_DSM_2015.pdf
Automated Categorical Data Analysis using CatBoost: Link: https://www.analyticsvidhya.com/blog/2017/08/catboost-automated-categorical-data/